Our customer has a financial management system used by hundreds of users across the globe. Users reported that they were being disconnected from the application while they were using it. Attempts by the user to recover from a disconnect event to resume their work always resulted in their application account becoming locked out. This problem occurred at least daily and frequently affected multiple concurrent users.
This application uses a conventional 3-tier design:
- Web Servers (Oracle HTTP Server)
- Java Application Servers (Oracle WebLogic)
- Database (Oracle)
The Web and Application tiers are spread across multiple data centres in the same geographical area.
We captured network traffic from each Web and Application node and correlated this data with application logs to accurately determine the time a problem instance occurred, the user affected, and the web and application node used for the user session at the time of the problem.
From analysis of the network and log data we proved that the problem occurs when a Web node experiences TCP port exhaustion at the time a user request is received. If this lasts longer than one second, it triggers a fail-over module in the Web node which causes communications to a specific Application node to be redirected to other Application nodes.
This behaviour causes an interruption to the user session to a specific Application node and the user experiences a disconnect event.
We presented the evidence to our customer in a findings report showing the detailed diagnostic data used to analyse the problem.
Once the root cause of the problem was proved, there were several possible solutions:
- Add more web servers
- Reduce the number of applications/ports
- Reconfigure the web servers to support more TCP sessions
The solution chosen by our customer was to increase the amount of memory available on the Web servers and so support more concurrent TCP sessions.