I recently traveled to help an enterprise client design and build a CI/CD pipeline to help with their migration to DevOps. This project got me thinking about how there are abundant resources for both Continuous Integration and Continuous Delivery…but not much is available for what actually changes in enterprise environments. How can CI and CD still fail in the enterprise? How can we anticipate failures and fix them?
Here’s how CI/CD usually fails
Anyone who looks into how to build a CI/CD system has likely found tons of information about how to design it, but little for what comes next. The major problems that I have faced in real world deployments all revolved around failures. Some failures don’t get caught until late in the process (a few don’t get caught until after they were deployed in production), and sometimes overcompensation leads to false failures.
Unfortunately, both of these issues take time to resolve, but this is time well spent. In the worst possible case, something made it into production before being found. It is imperative that a test be written at the same time as the fix. Having this test will prevent the same issue from happening again and allow verification of the fix.
A few years ago, I was working on a decently sized OpenStack deployment and a breaking network change made it into production. While trying to fix the issue, we actually made the situation worse because there was no test written and we kept making unnecessary changes. Once we stopped and wrote a test to make sure the behavior was not being reproduced, we found that it was a problem with the way the change was being rolled out. One line change later in Ansible, and everything was fixed. It took a combined team of Network and CI/CD specialists over a week to resolve because we kept thinking it was fixed without having appropriate test results. And all because we thought we were saving time.
Once you start to implement robust tests, you will probably start seeing false failures. That is okay, it’s just what happens when building CI. Overcompensation happens when we get zealous about our testing, but it is easier to fix than allowing breaking changes into production.
Once a test exists, it’s imperative to run it as early in the pipeline as possible. I don’t like the “shift left” expression (a common term used to help explain that testing should happen as early as possible in a pipeline), because it gives the impression that testing has to be linear. Most CI platforms allow concurrent testing, and in many cases parallelism is a good idea. What I mean by this is that anywhere possible, run steps at the same time rather than making everything happen in sequence. We need to all realize that time is time. Save time by having tests fail as quickly as possible. If there is a test that could be run in CI but doesn’t run until CD, time is wasted.
When a test catches a failure but only issues a warning, who is going to notice? I certainly don’t have time to read the log from every CI run. Failing hard means that when a failure is found, the build does not pass. Remember, a failure is a failure. End of story. False failures are better than missed failures because they are usually easier to detect.
DevOps in an enterprise environment
Enterprise environments face the same challenges as other deployments, but bring their own unique challenges. One of the biggest hurdles is the waterfall mentality (a rigid step-by-step process of engineering). Even though it’s 2018, many of these deployments have been around in some form for a decade or two (I’ve seen a few that have a century of lineage). That’s a lot of process to unravel before a true DevOps mindset (a more fluid way of combining developers, operations, and quality assurance) can take hold.
While the cultural shift to DevOps in the enterprise is possible (and arguably a good move), there are a few things that can’t change. Many companies have customer service level agreements (SLAs) to consider. Some organizations (such as financial, healthcare, and government) have specific auditing requirements that might surpass the abilities of common monitoring options (such as an ELK stack, Prometheus, or Sensu). Then you have other organizations that maintain legacy software that seem impossible to update.
The first step is to figure out what your organization has and what its requirements are. Once you have a clear list of the requirements, and what existing assets can be called on to get your new CI/CD system off the ground, there are a couple of steps that will help developers get onboard.
Start with some basic tests
If your codebase is not already using a solid set of tests, there are many things that you’ll need to address. First, you need a suite of basic tests. Begin with looking for linters (a linter helps find errors in code before it is ever run) and unit testing (unit tests run against small portions of code—like each function—to check that they behave the way you expect them to). If there is a list of tests that can be run on the codebase, you’re off to a great start. Otherwise, try to focus on building basic tests that can easily be adapted for future needs. Take a look at my blog on the standardized CI system that we use for our patterns for some help on getting started.
Create a development sandbox
The second thing you need is an easy way to create self-contained testing environments, known as development sandboxes. Most importantly, you must be able to quickly create a new, clean environment in a repeatable way. If you are deploying a service, this means creating a new virtual machine or container at will. If you’re building a Python app, you may only need to use a new virtual environment with a requirements.txt specifying the exact version of everything that is needed to run your app. Whatever it is you need, keep it as close to your production setup as possible to limit possible false failures and false successes. You can get away with some small changes, but having the sandbox as close to production as possible makes tests more reliable.
As someone who likes to travel, and may or may not have access to the corporate network, I suggest that your development sandbox be runnable on local machines. Having a Vagrantfile that lets me create a new sandbox whenever I want is much more useful than having to connect to vSphere somewhere. Either way, if a developer has to request a new sandbox, the process will not scale. I can’t stress enough the importance of all developers being able to create a new sandbox whenever they need.
As a bonus of our automating sandbox creation, we can just run that automation as the final step in our CI. Now we have linters, unit testing, and a basic deployment check. Awesome! Now while some of the developers are working on updating the existing codebase, we can move on to address some of our other concerns for bringing our enterprise’s workflows into the DevOps world.
The CD side
When you deploy at scale, having a dedicated testing environment is super helpful. The name of the environment doesn’t really interest me (I tend to name them “test”), but some common ones are UAT, PSR, QA, Staging, pre-prod, and more. While the name doesn’t matter to me, what happens in it does.
First off, all changes need to be made in code. If it’s not in source control (git, svn, cvs, wherever), it doesn’t happen. Ever. Being in source control means that it has passed CI and that someone looked at it. Being in code means that it is (in theory, at least) repeatable.
After a deployment happens in your testing environment, series of tests need to occur. Smoke tests are helpful to make sure that key functionality is behaving as expected. After smoke testing passes, you can run performance, scale, and reliability checks to help make sure the deployment will handle load.
In some high-pressure deployments, it might be necessary to have two testing environments for an extra buffer for testing and human verification. As long as the second environment has a different name than the first, I’m still not interested in what it’s called. What is important is that the first environment has smoke tests and PSR tests, and that the second at least has smoke tests before anything else happens.
If you think you are going to need a second testing environment, stop and weigh it out. I have used them with great success in high-pressure situations, when compliance required that a human approve all production deployments. But there is a heavy cost. Remember that more testing environments mean longer time requirements for code changes to take effect—even when something goes wrong. Having more environments also means that there has to be funding for more infrastructure.
If you only take three things away from this protracted blog, here are the three key takeaways: the time commitment to build and enhance CI/CD is critical, all developers need access to all CI tests and a way to deploy locally, and when in doubt, fail hard and fail fast. Those three go a long way to address most of the common problems with CI/CD.
Want more on DevOps and CI/CD?
- See more DevOps content here.
- Check out Code’s open source project, Gremlin, a framework that enables developers to conduct failure recovery testing on microservice applications that are in production without impacting user experience.
- See what IBM Continuous Delivery can do for you.
- Transform your enterprise with DevOps.