Telehouse Platform Service Interuption

Incident Report for Gradwell Communications Ltd

Postmortem

What happened?

At 20:03 Gradwell's networking supplier performed maintenance on one of Gradwell's main networking connections. This caused a brief loss of connection and triggered a reroute of traffic away from the affected parts of the network. Unfortunately, the replacement route that the network tried to take was miss-configured and traffic was routed to nowhere. This meant that no inbound traffic to the Gradwell network was able to connect.
The original fault was resolved at 20:04 and the connection was able to receive traffic again. However, the network did not automatically send traffic back to the original connection.

Gradwell engineers enacted our disaster recovery plan at 20:23. This involved updating the DNS of our core platform to route to our Cloud native platform, away from the network errors. Hosted VoIP and SIP trunks services were restored by the recovery work at 20:52. Further remedial work to enable inbound IAX traffic was completed at 21:44.

Networking engineers were engaged at 22:14, and the null routes were identified to be the cause of the network fault. The engineers manually removed the miss-configured routing, resolving the network fault at 23:37.

The null route that caused traffic not to reach Gradwell was put in place following Gradwell's disaster recovery testing in July 2019. The null routes were part of a suite of changes put in place to improve resilience for Gradwell connectivity customers. The remainder of the work, to ensure that the null routes were not prioritised over other available routes, was not completed.

If this work had been completed, Gradwell services would have been disrupted from 20:03 to 20:04 only.

We are working closely with our network supplier to help them identify why the routing work was not completed and improve their processes to make sure this does not happen again.

‌

What happens next?

Here at Gradwell we are continuing our focus on service stability. We understand how important it is that customers can rely on us and would like to apologise for the disruption caused by this incident.

We are working to make our recovery processes smoother, and automate failover to our Cloud-native platform. This incident showed that the disaster recovery plan was able to restore service to core products, but that the time taken to do so was too long. It has allowed us to identify further improvements and we are working to deliver these this week.
This incident highlighted Gradwell’s reliance on our networking supplier, and the impact of changes made on the network. We are working closely with our supplier to improve the service they provide and improve the monitoring and maintenance of the network.

Once again, we apologise for the disruption caused by this incident. We are committed to providing a best-in-class service for our customers and can assure you we are working hard to deliver this.

Posted Oct 08, 2019 - 10:51 BST

Resolved

This incident has been resolved. Thank you for your patience.

Posted Oct 03, 2019 - 06:54 BST

Update

All services are now fully restored and running from the primary Telehouse data centre. We will continue to monitor overnight, and will close this issue tomorrow morning once service continues to be stable.

Posted Oct 03, 2019 - 00:11 BST

Monitoring

We have resolved the networking issue, and our primary Telehouse datacentre is now back online. We will work to ensure services are restored before we revert our Disaster Recovery plan.

Posted Oct 02, 2019 - 23:42 BST

Update

We are continuing to work on resolving this network incident. We believe we have found the cause of the problem (route cause?) and are working to restore the Telehouse datacentre over the next few hours.

Posted Oct 02, 2019 - 23:33 BST

Update

We have now enabled inbound IAX calls via our AWS platform, so customers using inbound IAX trunking should see calls connect. Please note they will come from different IP address ranges. Please see: https://support.gradwell.com/hc/en-gb/articles/215553563-What-IP-addresses-may-Gradwell-send-VoIP-traffic-from-

Posted Oct 02, 2019 - 21:51 BST

Update

Service for Single/Multi User VoIP is being maintained via AWS, as well as inbound and outbound SIP trunks. However, our initial attempts at service restoration in telehouse have not worked, so that work is ongoing.

Posted Oct 02, 2019 - 21:37 BST

Identified

We have identified that this is a problem with the Telehouse networking, and not power, and we are continuing to work on resolving the problem promptly. In the mean time, service is being restored via our AWS platform.

Posted Oct 02, 2019 - 20:49 BST

Update

Whilst we are identifying the problem, we have triggered our fail over plan, to route traffic into our AWS platform, so customers will see partial service delivery.

Posted Oct 02, 2019 - 20:28 BST

Investigating

We have confirmed a problem in our Telehouse datacentre, in that all Telehouse based VoIP services are currently offline. It looks to be a power related, issue, but we are currently investigating.

Posted Oct 02, 2019 - 20:18 BST

This incident affected: Voice & Calls Services (Multi User VoIP, Outbound SIP Trunking, Outbound IAX Trunking, Inbound SIP trunking, Inbound IAX Trunking, Email2Fax, Fax2Email, Gradwell Mobile, Single User VoIP, Hosted 3CX Server).