Loss of Service on Broadband Connections
Incident Report for Gradwell Communications Ltd
Postmortem

Description of outage and impact:

Gradwell were alerted to the incident that a number of customers were experiencing total loss of service on broadband circuits. Gradwell Engineers were fully engaged on the incident which was given a priority one status

Cause & Resolution

At 11:30, the upstream supplier were made aware of an incident affecting services hosted within their Telehouse PoP. Due to the range of services affected, a Major Incident was declared. During initial investigations, it was identified that a routing process had failed on a core router in the affected PoP. The cause of this routing failure was linked to a routine change deployed moments beforehand. The impact of the change caused the BGP routing process to stop on the affected device, which caused a drop in connectivity across multiple voice and data services. The upstream supplier’s core network and infrastructure engineers were able to restore service by reverting the change made and re-starting the process. This was completed at 11:50, with the majority of affected services being restored by 11:58.

Root Cause

The upstream supplier identified an error in the delivery of the routine change detailed above which resulted in the BGP process stopping. This caused a loss of routing for multiple voice and data services which terminates at this location.

Prevention of recurrence

The root cause of this incident has been identified as a human error in the deployment of a routine change. Although the correct change control process was followed, the post-incident review was completed and identified an area of improvement required. As a result, an update is being made to the upstream suppliers change control process with specific focus on the way in which such changes are deployed into production.

Posted Jun 26, 2023 - 13:04 BST

Resolved
Hello,

Following Gradwell monitoring the incident, we are confident that the services have restored successfully.

Please accept our sincerest apology for the inconvenience caused. We will provide an RFO within 10 working days.
Posted Jun 07, 2023 - 13:42 BST
Monitoring
Hello,

We can see the circuits are now starting to restore their connection. We are communicating with our upstream supplier for the root cause of the outage. To ensure that your services are fully functional, we would request that routers are rebooted. This will rule out authentication issues on the device.

Please accept our sincerest apology for the inconvenience caused. We will provide an RFO within 10 working days.
Posted Jun 07, 2023 - 12:16 BST
Investigating
Hello,

We are aware of an issue that is impacting a subset of our customers broadband connections going through one of our upstream carriers. Their network engineering and network systems teams are continuing to diagnose the issue.

Please accept our sincerest apologies for the inconvenience caused. We will provide a further update by 12:30pm
Posted Jun 07, 2023 - 11:48 BST
This incident affected: Connectivity (Fibre line, EFM, ADSL Broadband, FTTC).