Our financial services customer has a document management application which supports hundreds of internal users. They were experiencing problems causing the application to freeze.
The customer was able to reproduce a ‘hang’ in the application at will. Opening the document search panel and selecting a document both appeared to make the application freeze. These freezes were lasting more than 20 seconds, and the customer was able to demonstrate the problem to the Advance7 team.
We also needed to find out why the application was 50% faster on the ‘Fat’ client (standard desktop PC) vs. the ‘Thin’ client (VDI).
Advance7 found evidence that window update / refresh messages were being stalled on the main application thread, and a decision was taken to capture some client-side data before proceeding to a full system data capture. Client-side delay could then be ruled out quite easily.
The environment consisted of clients deployed via XenApp, application servers accessed via WCF NET.TCP, MS SQL Server OLTP RDBMS and a vSAN for document storage:
Based on the demonstration of the problem, a decision was taken to make a small client-side data capture. So, we collected diagnostic data from the following sources:
- Network trace data covering all ingress / egress traffic for the user PC running the thin client
- Network trace data covering all ingress / egress traffic for the user PC running the fat client
- Multiple full process dumps taken during the thin client hang(s)
- Multiple full process dumps taken during the fat client hang(s)
The larger system capture plan involved collecting diagnostic information from the following sources:
- Client-side monitoring:
- Process dumps of the client application taken during the hang.
- Network data capture via Wireshark on the client PC.
- TCP data between the client and the application server (WCF NET.TCP).
- TDS data between the application server and the SQL AoA cluster.
- SMB data between the application server and the document storage vSAN.
However, this capture was not required as the delay time was found to be client side;
We captured and analysed multiple application operations and found multiple issues with the application. The most severe was the (thin) VDI client which was up to 50% slower than the PC client. This was due to excessive remote caching operations via SMB2. The SMB2 responses were not in themselves slow, it was purely the volume of operations performed:
Through stack analysis and heap investigation of process dumps taken during the application ‘hang,’ we confirmed the early diagnosis of blocking operations on the processes STA thread in an apartment model. This was responsible for the application hangs perceived by the user.
The easiest way to increase VDI client performance was to eliminate the remote caching operations (specifically the SMB2 ‘chatter’). By working with the application vendor, we discovered configuration settings that were adjusted to force the Thin client to use a local caching policy. A 50% speed increase was achieved making the Thin and Fat client performance comparable.
Investigation of the process dumps taken during the client ‘hang’ confirmed early diagnosis of operations being performed on the STA thread in a single apartment model. This caused OS window refresh messages to queue. That is what the users interpreted as a ‘hang’. At no point was the application ever frozen or stuck in recursion. The vendor provided patches for the application that did not entirely remove this problem but lessened the impact.