On Friday 30th May, an alert was triggered indicating a problem on our core platform. Immediately, we began experiencing failures in both inbound and outbound call setup across the platform, resulting in a Major Service Outage. During the incident, customers would have experienced delays or timeouts when attempting to place or receive calls. Importantly, in-flight calls remained unaffected. The incident was immediately escalated to a priority one status, and Gradwell engineers were fully engaged in the investigation and resolution process.
The initial cause of the outage was traced to a deadlock in the call handling process. This deadlock rendered part of the system unresponsive, which led to significant delays and timeouts in call setup. The resolution occurred automatically when our tooling identified the failure and terminated the connections, allowing services to recover without manual intervention. This action restored call traffic in both directions. However, after restoring core functionality, it was observed that a small subset of customers continued to experience issues with inbound call traffic. Further investigation revealed that the issue stemmed from an upstream supplier who had begun routing traffic through new IP addresses without notifying Gradwell. Once the new IPs were identified and whitelisted, normal service was restored for the affected customers.
The root cause of the incident was a deadlock in the call handling process that led to system-wide timeouts. A contributing factor to the extended impact for some customers was the unannounced change by an upstream supplier, who began sending traffic through new IP addresses that had not been pre-approved or communicated.
Our alerting systems operated as expected and will remain in place. However, to further reduce the risk of recurrence, development work is already underway to redesign the post-call processing components of the call handling system, thereby preventing future deadlocks. Additionally, steps are being taken to improve coordination with upstream suppliers to ensure any changes in routing, such as new IP addresses, are communicated in advance and properly integrated into our systems.
Please accept Gradwell’s sincere apologies for the service disruption and the impact it has had on your business.