Gradwell were alerted to the incident that a number of customers were experiencing total loss of service on broadband circuits. Gradwell Engineers were fully engaged on the incident which was given a priority one status
At 11:30, the upstream supplier were made aware of an incident affecting services hosted within their Telehouse PoP. Due to the range of services affected, a Major Incident was declared. During initial investigations, it was identified that a routing process had failed on a core router in the affected PoP. The cause of this routing failure was linked to a routine change deployed moments beforehand. The impact of the change caused the BGP routing process to stop on the affected device, which caused a drop in connectivity across multiple voice and data services. The upstream supplier’s core network and infrastructure engineers were able to restore service by reverting the change made and re-starting the process. This was completed at 11:50, with the majority of affected services being restored by 11:58.
The upstream supplier identified an error in the delivery of the routine change detailed above which resulted in the BGP process stopping. This caused a loss of routing for multiple voice and data services which terminates at this location.
The root cause of this incident has been identified as a human error in the deployment of a routine change. Although the correct change control process was followed, the post-incident review was completed and identified an area of improvement required. As a result, an update is being made to the upstream suppliers change control process with specific focus on the way in which such changes are deployed into production.