Our customer has a broking platform supporting hundreds of internal users.
The system submits messages to a Gateway server for processing, and there was an expectation that the system could handle tens of submissions per second.
However, the processing would slow down to 1-2 submissions per second. This would cause errors to be logged by the application servers that were sending the submissions.
Despite identifying some areas for improvement, the internal IT team were still reporting errors being logged, and a drop in the expected speed of processing.
The environment consisted of 5 application servers and a batch processing server as below:
Analysis of the network traffic determined the primary protocols used by the Gateway server were as follows:
- SMB2 (Server Message Block Version 2)
- TDS (Tabular Data Stream, used for Database communications)
- HTTP (Inbound server to server communications)
The components of this system were all located within the same data centre.
The platform was built using Windows Server 2012, IIS, Microsoft SQL Server 2012 and .NET, and the system is virtualised on a CISCO UCS platform using ESX and an ISILON SAN for network shares.
Network traffic was captured at the Gateway server, IIS server and database servers.
In addition, performance statistics were also captured using Windows PerfMon, including .NET performance counters for the application.
The customer provided application and Windows event logs, which allowed a precise determination of when the errors occurred.
From analysis of the network transactions, we were quickly able to determine that of the protocols in use, there were no significant delays in either HTTP or SMB2 network traffic.
However, as the chart below indicates, there were significant delays present in the database transactions that use the TDS protocol.
Specifically, the time delay is within the database server itself, and not within the network.
The ‘banding’ in the chart shows that blocking is occurring within the database server.
In addition, using the PerfMon data it could be seen that:
- The IIS worker processes spent a large percentage of time running Garbage Collections.
- IIS application pools ran under a mix of .NET v2.0 and .NET v4.0.
- The .NET applications were 32-bit and not 64-bit.
- Disk write latency on the Gateway server was determined to be high.
Message processing throughput decreased due to poor database performance:
- Details of the long running SQL statements causing blocking were shared with the application development team and database administrators.
- With this information the database administrators tuned indexes and query plans to reduce processing time of the long running queries.
- The application developers migrated SQL statements to stored procedures with less aggressive locks to reduce blocking.
As well as identifying the root cause of the problem we shared other recommendations with the respective technology owners that would deliver service improvement:
- Determine the database performance impact of running scheduled reports at that same time as serving end user queries.
- Consider migrating from 32-bit to 64-bit .NET application pools to reduce the rate of garbage collections.
- Investigate why the disk write latency on the Gateway server was higher than other components.