The change to continuous delivery for IMS has had many ramifications, including the way we verify APARs before shipping them. This process has been done the same way for as long as I have been with the company. Every week:
- APARs finish unit test and review, becoming ready for submission to the weekly Life Cycle Test (LCT) process.
- A collection of test cases are executed against all the APARs that have been submitted to LCT.
- A team looks at the failing test cases and determines if the failure is due to a code error or some other issue (timing issues, data set space or catalog issues, missing or invalid environment settings, etc.).
- This team decides if each APAR is to be removed or is to be certified and ready to be shipped.
I’m not saying there’s anything wrong with this process; in fact, we’ll be keeping the same general format going forward. It’s just not as efficiently implemented as it could be. For example, it takes us a week to run and verify all these test cases. That is no longer acceptable.
Back in the day we had a small dedicated team taking care of this for us…running these test cases manually. I don’t know about, you but that sounds awfully tedious. The good part about it is that they only had between 400 and 500 test cases they needed to get through per release. We didn’t have enough time in the week to run more. In the early to mid 2000’s there was a push to modernized the process. We started by adding a tool that would automatically rerun failed test cases; this increased the number of successful results and reduced the number of failures that the LCT team needed to look at and resolve. Secondly, we added a new scheduling and reporting tool, which would allow us to drastically increase the number of Virtual Machines (VMs) we were using to run test cases. With these changes in place we were able to increase the number of test cases we were exposing to each outgoing APAR from 500 to over 5,000. In fact, IMS 15 has over 7,500 test cases executed every week for each batch of APARs!
We Can Do Better
Traditionally the focus of LCT has been to get every test case successful; sometimes that’s more work than it should be. We have had a lot of people writing test cases over the years, and not all of them really understood the best ways of doing that (there are even some test cases I wrote that are embarrassing to look at, ah blissful ignorance). That left us with more than a few problemtest cases and even some poorly written tools and common subroutines. All of these issues have caused a single headache: manually rerunning failed test cases to make them successful. Most of the time the current batch of APARs has nothing to do with a failing test, which means we’re wasting significant time and resources trying to reach this goal every week.
For the past six months, we have been focusing on two major improvements:
- Reducing the number of test cases that fail for a non-APAR reason
- Speeding up the test case execution.
I’m excited by the improvements we’re seeing: 60% fewer first time failures (tests that fail the first time they are attempted), and a 71% speed up in throughput. Last year some of these worklists would never complete in the maximum allotted time of 7 days; now we’re able to get through all of them in 2! We’ve fixed test cases, subroutines, environment setups, added clean up and setup routines, removed unnecessary static wait times, and more.
We Can Do Better
The next phase is ambitious, replacing the worklist model with something more flexible. Our goal is to be able to execute all 7,500 test cases in one 24-hour period. To do this we will be implementing a new framework that will no longer be using worklists. Instead of each VM looking at a worklist of test cases to execute, it will instead ask for a test case to execute from a pool. The problem with the worklist model is that when a VM executes the last test case in a worklist, the VM is out of work to do and sits idle: Even if there are still hundreds of tests that could be executed by it. We currently have no way of redistributing the work half way through the cycle. With the new process, the VM would continue to ask for test cases until there were no more test cases in the pool. This gives us two major advantages:
- Each VM is working until we’re done; none is sitting idle.
- We can dynamically add or remove VMs on the fly.
When we get this working we are expecting to be able to hit the 24 hour mark and perhaps exceed it.
How fantastic would it be to be able to put an APAR into LCT and know within a day if is ready to ship? That is something no one would have thought possible two years ago, and we’re now on the cusp of making it a reality. In the near future we will be looking at more improvements, LCT on demand and relevant test case identification to name a few. I’m excited to see where we’re going. 2018 will be an adventure that’s well worth taking! If you have questions or comments about the IMS Lifecycle Test process, drop them here!