Please find an abridged version of the reason for outage [RFO] supplied in relation to the Inbound & Outbound call failures outage on 25/08/2020.
Here at Gradwell, we are continuing our focus on service stability. We understand how important it is that customers can rely on us and would like to apologise for the disruption caused by this incident.
A full copy of the RFO is available upon request to support@gradwell.com
Gradwell’s voice monitoring platform recorded sequential and repeated call failures across one of our Enterprise-grade voice location platforms in London, causing customers to experience inbound and outbound call failures for some time. This was a result of a Major Service Outage within the Enterprise Grade supplier’s location in London. At this time a Major Service Outage (MSO) was declared, and Gradwell’s CTO was appointed as Senior Problem Manager who took control of the incident.
This was treated as a priority 1 outage and lasted 1036hrs BST 25th August 2020 – 1205hrs BST 25 August 2020.
Gradwell Engineers were fully engaged on the incident and aware that the failover was made to our alternate Data Centre location in London. Whilst active and ready to take calls they noticed that the internal failover within the Enterprise-grade supplier was also affected, despite only one of the 2 affected zones being reported. Gradwell’s Engineers attempted to restore the failover to the alternate zone to restore all alternate services into service which were also being affected. This was achieved by removing the reliance on key features within Enterprise-Grade suppliers’ zones availability within London. However, the outage with our supplier was more significant than first reported, seeing both alternate and diverse separated zones affected. Test calls into a single zone confirmed that only 1 in 40 calls were successful at best and require further action to restore all customer calls, which was corrected and restored.
Working in tandem with the supplier to restore services we were able to remove the reliance on this platform in London. Gradwell’s Engineering team created a new geographical dispersed instance to allow the servers to operate within a differing region for non-sensitive data pre-pay call authentication. Done so to take services away from London and isolate the region. This was undertaken until a complete restoration had been realised by the supplier’s core region in London and Gradwell were confident it could be brought back into service.
At 1205hrs: All Gradwell calls restored and customer calls returned to normal levels of connection and success rates.
Gradwell has now removed the reliance only 2 x diverse zones for all call Pre-Pay authentication and are adding in a third diverse zone, as well as a differing geographically dispersed region, allowing 4 zones in total for operational use.
This has now also been mitigated since 25 Aug 20 where an alternate geographical zone is created which also handles call authentication, load sharing and able to take over all Pre-Pay authentication should a further failure occur.
Increased failover tests out of hours to test and prove successful failover between geographical locations and regions and its new diverse geographical location will be undertaken each 1/4yr.
Gradwell apologies for the loss of services and the impact it has had on your business and has now expedited a permanent remedy.