How we overcame performance nightmares in our monolith app
Steps for performance improvement
Subscriber and Subscription Management (SSM) is the system that funnels orders for IBM SaaS offerings offered through IBM and third-party marketplaces to the appropriate endpoints. This provisions orders for the customers and manages their entire subscriber and subscription lifecycle. It handles about 2,000 requests per hour.
SSM is a legacy monolith app. However, dealing with such a mission-critical application with millions of lines of code can be a nightmare. Making it more complex is the transaction handling implemented at every smallest service layer unit. To support high-end business use cases, there are dozens of composite APIs that SSM supports. These composite APIs internally make calls to the smallest-unit APIs, holding multiple DB connections for a single composite API request.
This eventually resulted in blasting the DB memory, losing myriad live transactions. You might be asking:
- Can’t transaction handling be implemented at the composite API level rather than the smallest API unit? No because the data access layer structure is tightly coupled with the lower-level APIs, and moving to higher level would introduce a lot of stale object-state exception cases.
- Can’t the monolith app be broken down into microservices architecture, which is a current market trend? No because this is a costly affair in terms of resources and time; and moreover, developers were busy in tending to the above issue, giving no room to think and invest time in this approach.
It was critical to find a fast and efficient solution to this problem, since it impacted the business. To make things worse, with SSM being at the core of the marketplace ordering flow, both upstream and downstream systems were significantly impacted. It was also difficult to identify the source of the problem, whether it was at the code, database, or infrastructure layer (since the application is deployed on IBM Cloud). With the team’s engineering skills and aggressive debugging, the issue was analyzed.
Pattern Discovery Phase
We analyzed the historical performance issues using an internal monitoring tool. This helped identify a huge number of calls were being made to fetch a user with many roles or associated entitlements, resulting in the application consuming more resources and ultimately causing delays for future API calls. Tis was a progressive effort achieved thorough:
- Grouping the specific APIs in the monitoring tool that caused additional load to the application.
- Taking a snapshot of historic data, enabling us to find the pattern that caused the performance degradation.
- Creating similar API sets to run in an SSM preproduction environment.
Problem Reproduction Phase
Performance load tests were run on an SSM preproduction environment over a few weeks at different times of the day. For every run, heap dumps were collected. Heap dump collection for analysis was a bottleneck. The solution was to kill the main Java process and copy it to a local machine for debugging. Steps to collect the heap dump from IBM Cloud environment:
ibmcloud target --cf -sso
ibmcloud cf apps
ibmcloud cf ssh <appname>
Run - ps -aux(to get the process ID of the running cf apps)
We then killed the process ID with the
-3 option (do not use the
-9 option). Once the above commands are fired, you will notice core dump under the following folder:
vcap@27854948-c2e2-4bc8-7649-c266:~$ ls -ltr /home/vcap/app/ total 5840 drwxr-xr-x 4 vcap vcap 62 Jul 5 09:50 WEB-INF drwxr-xr-x 3 vcap vcap 38 Jul 5 09:50 META-INF drwxr-xr-x 2 vcap vcap 26 Jul 5 09:50 jsp -rw-r----- 1 vcap vcap 5979538 Jul 5 12:40 javacore.20210705.124041.16.0001.txt
You can generate as many core dumps as you want (depending on the investigation).
Next, we copied the remote core dump into a local laptop:
ibmcloud cf ssh <appname> -c "cat <path of core dump>" to local laptop path directory.
After couple of executions, the same scenario was simulated, which gave some confidence that the investigation is on the right track. It was indeed a daunting task to simulate it over and over again during peak times.
Problem Analysis Phase
With a few dumps, REST calls (
POST) were analyzed in depth. This gave insights on the degraded application behavior. The
GET calls were holding the DB connection even after getting the result set. In between, other incoming requests waited for the DB connections to release. This typically caused a deadlock situation, resulting in the overall app going into degraded performance mode during high-traffic times, resulting in a crash. As the following screenshot shows, 75 threads in
"at com/mchange/v2/resourcepool/BasicResourcePool.awaitAvailable(BasicResourcePool.java:1503(Compiled Code))" were awaiting connection from pool.
Based on the analysis, the commit mechanism of the
GET calls was changed from
autoCommit = False to
True. This releases the connection immediately when the result set is fetched vs. holding the it until the end of the transaction.
We fine-tuned the DB connection pool size for optimizing the connections between the application and data layer. We increased
ibernate.c3p0.max_size from 125 to 250 to create additional DB connections in the DB connection pool. We also reduced
hibernate.c3p0.idle_test_period from 120 to 60 (time for which the connection can be idle before releasing).
The combined approach above resulted in ~80% improvement in the response time for all APIs.
The performance improvement was beneficial and had a positive impact on the API consumers. The journey was harder, but the discovery and learning made the application and the team more resilient.
Thank you to Anil Sharma for the analysis on the database and Bhakta for sharing expertise on heap dumps. And special thanks to Nalini V. for guiding us in this journey.