Database Replication Problem Stalls Cloud Project at UK Retail Bank


A major program to deliver solutions using a cloud and on-premise model was put on-hold for a number of weeks due to a database replication problem.  Although several changes had been made in an attempt to circumvent the problem, the database still couldn’t complete replication between the cloud and the customer’s own data centre.


Using our standard structured approach, we devised a Diagnostic Capture Plan. This allowed us to capture and match encrypted and plain-text network traffic at different points in the network path during the database replication process.  Due to the encryption used, only by matching packet-for-packet where required, could we ascertain which device was causing the problem.


By analysing the data, we saw that the replication was slowed to a near-halt due to incorrect TCP behaviour by a firewall in the data centre. This had caused the sending database server to transmit very slowly for extended periods.  In order to provide an application-layer gateway (ALG) service, the firewall was proxying some but not all aspects of the connection. This lead to an inconsistent view of the TCP connection state being provided to the database server.  The result was that the database replication process required many hours rather than a few minutes to complete.


Once we had presented our findings to the customer, the firewall team was able to work-around the problem by disabling the ALG service for the traffic.  This allowed a well-behaved TCP connection between the two servers that performed the data transfer and the database replication. This resulted in the process completing in just a few minutes.  The blockage from the critical path of the entire program was then removed.


