The bank has a system that processes performance data for financial assets held by the bank and by clients to produce critical start-of-day data for downstream systems. Increasingly, the overnight operations teams would have problems with these jobs; they would stop mid-execution, with all remaining jobs held in a queue waiting to run. The workaround was to restart groups of servers, but this took some time, and the delays incurred threatened the ability to deliver the start-of-day data on time.
We were asked to help because the problem had been ongoing for several months, and it caused a high priority incident most nights.
This system had been designed and built to support high throughput. The C++ application was hosted on a commercial high-performance application server running on Red Hat Enterprise Linux, with 200 worker threads executing jobs in a parallel manner. The data layer was based on Oracle’s Exadata and Advanced Queuing technologies.
INVESTIGATION & FINDINGS
Advance7 SREs use a data-driven problem diagnosis method called RPR. After gaining an understanding of the symptom and the environment, the SRE creates a Diagnostic Capture Plan (DCP) that describes the diagnostic objective, and how to get the data we need to determine the root cause of the problem.
The first DCP defined a technique to passively trace the operation of the system using data sources that were already in place. From the data collected, the SRE quickly determined that a hang in the application was causing the problem; there had been an earlier concern that it was a database issue.
The SRE produced a second DCP, based on collecting memory dumps and other information at the time of the hang. To simplify this task and avoid mistakes, the Advance7 SRE team produced a simple computer script that collected dumps, ps information and netstat output.
Three sets of dump data were successfully collected and analysed to identify a memory corruption problem. In each case, the program stack for two threads had been overwritten. Unfortunately, one of the threads involved was holding a mutex lock on a critical resource; hence, the reason for the application hang.
Advance7 SREs provided full details of the problem to the development team, together with details of tools that could help them identify the cause of memory corruption. Senior Subject Matter Experts and IT managers were no longer required to attend daily conference calls, and so freed to get on with other important work.
Using the information we provided, the development team identified the cause of the memory corruption and are now working on a fix.