Inbound & Outbound call failures

Incident Report for Gradwell Communications Ltd

Postmortem

Please find an abridged version of the reason for outage [RFO] supplied in relation to the Inbound & Outbound call failures outage on 25/08/2020.

Here at Gradwell, we are continuing our focus on service stability. We understand how important it is that customers can rely on us and would like to apologise for the disruption caused by this incident.

A full copy of the RFO is available upon request to support@gradwell.com

Description of outage and impact:

Gradwell’s voice monitoring platform recorded sequential and repeated call failures across one of our Enterprise-grade voice location platforms in London, causing customers to experience inbound and outbound call failures for some time. This was a result of a Major Service Outage within the Enterprise Grade supplier’s location in London. At this time a Major Service Outage (MSO) was declared, and Gradwell’s CTO was appointed as Senior Problem Manager who took control of the incident.

This was treated as a priority 1 outage and lasted 1036hrs BST 25th August 2020 – 1205hrs BST 25 August 2020.

Cause & Resolution:

Gradwell Engineers were fully engaged on the incident and aware that the failover was made to our alternate Data Centre location in London. Whilst active and ready to take calls they noticed that the internal failover within the Enterprise-grade supplier was also affected, despite only one of the 2 affected zones being reported. Gradwell’s Engineers attempted to restore the failover to the alternate zone to restore all alternate services into service which were also being affected. This was achieved by removing the reliance on key features within Enterprise-Grade suppliers’ zones availability within London. However, the outage with our supplier was more significant than first reported, seeing both alternate and diverse separated zones affected. Test calls into a single zone confirmed that only 1 in 40 calls were successful at best and require further action to restore all customer calls, which was corrected and restored.

Working in tandem with the supplier to restore services we were able to remove the reliance on this platform in London. Gradwell’s Engineering team created a new geographical dispersed instance to allow the servers to operate within a differing region for non-sensitive data pre-pay call authentication. Done so to take services away from London and isolate the region. This was undertaken until a complete restoration had been realised by the supplier’s core region in London and Gradwell were confident it could be brought back into service.

At 1205hrs: All Gradwell calls restored and customer calls returned to normal levels of connection and success rates.

Prevention of recurrence:

Gradwell has now removed the reliance only 2 x diverse zones for all call Pre-Pay authentication and are adding in a third diverse zone, as well as a differing geographically dispersed region, allowing 4 zones in total for operational use.

This has now also been mitigated since 25 Aug 20 where an alternate geographical zone is created which also handles call authentication, load sharing and able to take over all Pre-Pay authentication should a further failure occur.

Increased failover tests out of hours to test and prove successful failover between geographical locations and regions and its new diverse geographical location will be undertaken each 1/4yr.

Gradwell apologies for the loss of services and the impact it has had on your business and has now expedited a permanent remedy.

Posted Sep 02, 2020 - 18:23 BST

Resolved

Our teams have continued to monitor all platforms overnight and the fix put in place at 1205hrs Tue 25 Aug 20 remains fully operational and without any further effect to service delivery. All recovery works by AWS are now completed as of 0630hrs this morning and a Return To Normal Operations (RTNO) has been declared.

Gradwell apologises for the loss of service to our customers and will make a Reason For Outage (RFO) available upon consultation with AWS and will be released in due course.

Posted Aug 26, 2020 - 09:47 BST

Monitoring

Gradwell has fully restored all operations within effect from 12:05hrs today. We continue to monitor the situation overnight with our engineering teams supporting throughout.

Once we are content the issue has been fully resolved and there will not be a recurrence, we will then issue a Return to Normal Operation (RTNO)

Any customers that have issues with their control panels, we believe this to be a DNS issue. Please reboot your computer and retest.

We will follow up at 10 am on Wednesday 26/08/2020

Please accept our apologies for the inconvenience caused.

Posted Aug 25, 2020 - 17:25 BST

Update

We continue to declare a partial Return To Operations (RTNO) and we continue to test, prove and ensure stability will be maintained throughout the network and systems we support you with.

Any customers that have issues with their control panels, we believe this to be a DNS issue. Please reboot your computer and retest.

We continue to monitor this issue and work to restore the services to their full capacity.

Please accept our apologies for the inconvenience caused.

Posted Aug 25, 2020 - 14:50 BST

Update

We have seen a return of call flow as expected across all elements of the network. We continue to declare a partial Return To Operations (RTNO) and we continue to test, prove and ensure stability will be maintained throughout the network and systems we support you with.

Please accept our apologies for the inconvenience caused.

Our next update will be in 3pm

Posted Aug 25, 2020 - 13:12 BST

Update

We are starting to see a partial return to service and calls are connecting as expected.
We will confirm we are to normal operations as soon as possible.

Please accept our apologies for the inconvenience caused.

Posted Aug 25, 2020 - 12:27 BST

Update

As a result of a major failure within AWS in London affecting UK customers, we are working with AWS to restore as well as migrate services to alternate locations whilst this gets restored. Gradwell has all available engineering resource assigned to restore this as soon as possible.

We apologise for the inconvenience caused.

Posted Aug 25, 2020 - 12:08 BST

Identified

We have identified a major outage within our London exchange. We are continuing to treat this as a P1 outage and working with our upstream carriers to resolve this issue.
We sincerely apologise for the inconvenience caused.

Posted Aug 25, 2020 - 11:32 BST

Update

We are still currently investigating the route cause of this issue. We are treating this as a P1 outage.

We apologise for the inconvenience caused.

Posted Aug 25, 2020 - 11:21 BST

Update

Our support numbers are also currently affected. If you require assistance, please email support@gradwell.com.
We are continuing to investigate this issue.

We apologise for the inconvenience caused.

Posted Aug 25, 2020 - 10:51 BST

Investigating

We are currently investigating an issue with outbound call failures.

We will have an update within the next 30 minutes

Posted Aug 25, 2020 - 10:47 BST

This incident affected: Voice & Calls Services (Multi User VoIP, Outbound SIP Trunking, Outbound IAX Trunking, Inbound SIP trunking, Inbound IAX Trunking, Single User VoIP).