At 20:03 Gradwell's networking supplier performed maintenance on one of Gradwell's main networking connections. This caused a brief loss of connection and triggered a reroute of traffic away from the affected parts of the network. Unfortunately, the replacement route that the network tried to take was miss-configured and traffic was routed to nowhere. This meant that no inbound traffic to the Gradwell network was able to connect.
The original fault was resolved at 20:04 and the connection was able to receive traffic again. However, the network did not automatically send traffic back to the original connection.
Gradwell engineers enacted our disaster recovery plan at 20:23. This involved updating the DNS of our core platform to route to our Cloud native platform, away from the network errors. Hosted VoIP and SIP trunks services were restored by the recovery work at 20:52. Further remedial work to enable inbound IAX traffic was completed at 21:44.
Networking engineers were engaged at 22:14, and the null routes were identified to be the cause of the network fault. The engineers manually removed the miss-configured routing, resolving the network fault at 23:37.
The null route that caused traffic not to reach Gradwell was put in place following Gradwell's disaster recovery testing in July 2019. The null routes were part of a suite of changes put in place to improve resilience for Gradwell connectivity customers. The remainder of the work, to ensure that the null routes were not prioritised over other available routes, was not completed.
If this work had been completed, Gradwell services would have been disrupted from 20:03 to 20:04 only.
We are working closely with our network supplier to help them identify why the routing work was not completed and improve their processes to make sure this does not happen again.
Here at Gradwell we are continuing our focus on service stability. We understand how important it is that customers can rely on us and would like to apologise for the disruption caused by this incident.
We are working to make our recovery processes smoother, and automate failover to our Cloud-native platform. This incident showed that the disaster recovery plan was able to restore service to core products, but that the time taken to do so was too long. It has allowed us to identify further improvements and we are working to deliver these this week.
This incident highlighted Gradwell’s reliance on our networking supplier, and the impact of changes made on the network. We are working closely with our supplier to improve the service they provide and improve the monitoring and maintenance of the network.
Once again, we apologise for the disruption caused by this incident. We are committed to providing a best-in-class service for our customers and can assure you we are working hard to deliver this.