In this case study we look at an investigation into a financial order management system (OMS) used by investment managers. The system is critical in the management of orders and needs to deliver up-to-the-minute information on order status and positions held.
When performing certain actions on a trade, the trader’s system would hang for a few seconds, but occasionally the hang would last up to two minutes. The problem was very intermittent; in our capture scenario, only occurring after two weeks without problems.
Although the application has few components, it’s quite complex. A fat client .NET application interacts with a Microsoft SQL Server database. A small piece of application code runs on the database server, and this interacts with the application server which, in turn, interacts with the database instance.
Most of the interactions are between the fat client application running on the user’s PC and the database, although occasionally the client application would interact with the application server.
When we investigate a problem, our first objective is to prove which technology is causing it. We do this by producing irrefutable diagnostic evidence of the cause. Theories and speculation immediately stop, and the focus shifts to discussions with the correct vendor or in-house support team.
We structure our investigations with a proven data-driven method called Rapid Problem Resolution (RPR®) that quickly delivers reliable results.
Key steps involved are:
- Gain an understanding of the specific problem symptom
- Gain an understanding of the end-to-end system
- Devise a Diagnostic Capture Plan (DCP) to capture the diagnostic data needed
- Execute the DCP
- Translate and analyse the data
- Identify the root cause
- Translate diagnostic data, ready to implement a fix.
RPR® PROCESS MAP
 Rapid Problem Resolution (RPR) is a data-driven diagnosis method developed by Advance7 in the 1990s and refined over the intervening years.
Two tools were used to capture the data for this problem:
Dumpcap was installed on a user’s PC and configured to record continuously into a ring buffer of files. At peak times, the data rate was reasonable at 400 MBytes per minute, giving an aggregate network load of around 53 Mbps.
TOP TIP: We created two Windows Batch Command scripts to start and stop the diagnostic data capture mechanisms. These scripts started continuous network captures and automated the dump capture by detecting an application hang. This avoided the need for a user to manually capture data. Full details of the .bat files are available here
After some time, the problem occurred and the IT team did a great job of collecting matching dump and network trace data. The analysis strategy was:
- Analyse the state of the main thread running the OMS application user interface (UI) to determine how it was being blocked.
- If it was being blocked by an external service, look at the network traces to determine if the problem was a network error, or a problem with the service itself.
TOP TIP: On the face of it, we might think that there is only one network service in use; the database. However, we needed to cover the possibility of an infrastructure service problem such as DNS or LDAP.
We started with some basic analysis of the collected dump with WinDbg. We needed to look at the state of the UI thread. Being a .NET application, interpreting a thread stack is not simple as .NET has its own stack structure and there are issues around locating code in memory. WinDbg provides a plugin to help called Son of Strike (SOS). This worked well and we could have continued with the analysis using WinDbg.
TOP TIP: Visual Studio is very good at .NET dump analysis and simpler to use than WinDbg, and so we switched to using that.
DEEP DIVE - DUMP ANALYSIS
A thread stack dump shows function calls (sometimes referred to as method calls), and as each function is called the return address of the callee is pushed onto the stack, giving us a path through the code. To determine the sequence of calls, read the stack in the order from bottom to top. The following is the top of the stack for the UI thread.
System.Data.dll!Tcp::ReadSync(class SNI_Packet * *,int)
[Managed to Native Transition]
oms!oms.SQLSupportVB.SafeOpenNet( ..., string sSQL = "SELECT a.dfMidPrice AS ...
The third entry shows that we are waiting on a TCP synchronous read (TCP::ReadSync). Any synchronous operation will cause the thread to stop executing until the operation completes. The module named System.Data.dll is part of the .NET framework. If we scan down the stack a little further we notice that the function calls are prefixed with the label System.Data.SqlClient. Looking at the last entry shown here we see a call to method System.Data.SqlClient.SqlCommand.ExecuteReader. The stack entry for the calling method (oms.SQLSupportVB.SafeOpenNet) shows full details of the SQL Query being executed.
Therefore, the main (UI) thread is blocked waiting for a SQL query to complete.
DEEP DIVE - TRACE ANALYSIS
The finding from process dump analysis was cross checked with the matching dumpcap network packet trace. The chart above shows the SQL response times determined from the dumpcap capture using Wireshark TRANSUM. We see lots of SQL transactions for this user taking 2 to 5 seconds around the time of the problem. As these operations are being executed on the UI thread, and this thread was being repeatedly blocked, the user experiences application hangs; the application window being totally unresponsive.
The first thing to understand is that the client is not actually hanging. The users think it’s hanging because the UI is unresponsive and that is because the STA or main thread in all Windows applications is responsible for handling Windows Messages. It is these messages that are responsible for refreshing the UI of an application and handling user input. If the main thread is busy doing other things these refresh messages wait in line until they can be processed.
OMS, unfortunately, appears to have been coded to run data processing operations on the main thread which is causing the appearance of hanging for the users.
During the ‘hang’ we could see multiple database calls taking 2-5 seconds.
As these operations are being executed on the UI thread, and this thread was being repeatedly blocked, the user experiences application hangs; the application window being totally unresponsive.
We provided details of the problem database queries to the customer’s Database Administrator, who identified that the delays were caused by contention on temporary tables.
We also provided the detailed evidence related to the application design (specifically about moving potentially slow operations off the main UI thread) and the customer is now pursuing the problem with the application vendor.