Gradwell Internet for business people

Gradwell Service News

Resolved VoIP Problems

RESOLVED: Power outage in Telehouse North

PROBLEM DESCRIPTION

Telehouse Power Failure

Several rooms in Telehouse North lost power at approximately 14:10 today.
This means that :

Some networks, or portions of networks with particular dependency on
Telehouse North will be off the air.
Your connection to some of your ISPs may terminate on equipment in
Telehouse North which has failed.
All networks that are unaffected will be handling much higher volumes
of traffic,leading to higher latency and packet loss even on connections
that are still available.

Affected Services:

All services

Customer Impact:

Dependant on users ISP peering/routing

Estimated Resolution Time:

We will update again by 16:30

***Update 15:13***

All systems should now be clearing and getting back to normal in Telehouse. Some routing issues will still remain for a little while longer and links will be seeing a much higher throughput as traffic has been re-routed.

***Closing Update 15:51***

Most peering links are now stable and disruption is minimal but please do be aware there still may be slight congestion. Our systems remained online and are not experiencing any problems caused by the earlier power outage.

RESOLVED: Loss of storage SAN causing network wide problems

PROBLEM DESCRIPTION

At approximately 16:35 BST, Gradwell’s system administrators were alerted to a problem with one of our storage SANs

Affected Services:

All services

Customer Impact:

All services may be affected, including VOiP and Hosting. Loss of phone registration and inability to call out or receive calls. Some websites will be offline.

Estimated Resolution Time:

We are investigating this now and will update by 17:45 with more information

***Update 17:00***

All lines into our support office are also offline at present


***Update 17:20***

Services are now starting to return to normal as we are restarting many affected services and servers. Our support line is now back online

***Update 18:15***

Most servers have now been restarted and most services are back on line. We are continuing to resolve any remaining issues.

***Update 18:31***

All services should now be available.

RESOLVED: Hosting and VoIP platform issues

PROBLEM DESCRIPTION

At approximately 6:50  BST, Gradwell’s system administrators were alerted to some system issues affecting a multitude of Gradwell services.

Affected Services:

Web services

Control panels

VoIP services

Customer Impact:

Affected customers will be seeing errors when accessing hosted websites/control panels and will be seeing errors when attempting to make outbound calls.

Estimated Resolution Time:

Our system admin team are working on this now and will update again at or before 09:30.

***Update*** 9:35

VoIP services and control panels should now be working as expected, we are still working on the web clusters and expect to have these back online shortly. We will update again at or before 10:30

***Update*** 10:36

The web cluster is now back online and all services should be running as expected.

There may be some slowdown on control panels as systems are busy processing any backlogs.

We are continuing to monitor and will update again at 11:30

***Update*** 11:29

All systems are running correctly and remain stable. We will continue to monitor closely for the next few hours and update/close this status at 13:30

***Update*** 13:20

All systems are now running correctly and we are now closing this status update.

The problem has been identified as being one of our DNS cache servers. This cache server, 193.111.200.191, stopped responding and this in turn caused our master MySQL server to effectively lock up. This then failed to respond to queries correctly. The majority of our infrastructure relies on this database, hence parts of it became unstable.

We apologise for any problems this has caused you.

RESOLVED: Customer MySQL DB server and PBX updates

PROBLEM DESCRIPTION

voip-manager and mysqldb are currently experiencing an outage.

At approximately 18:00 BST, our systems team became aware of an issue affecting both our provisioning and one of our customer-facing database servers.  This fault appears to have started at around 17:00, and was a progressive problem, as the relevant servers became less responsive over time.  We are presently investigating the fault, and working with our Telehouse operations team to rectify the problem as soon as possible.

Affected Services:

PBX provisioning

Customer databases on mysqldb.gradwell.com

Customer Impact:

Whilst all VoIP services are up, changes cannot be processed, so any updates performed via the control panels will not become live.

All customer databases hosted on the mysqldb.gradwell.com service. This will not affect other servers, such as mysql5db-1 or our other two MySQL 4-based servers.


Estimated Resolution Time:

We are presently awaiting on-site engineers restarting the affected services and will provide an update within an hour.

***Update 19:38***

Our on-site ops team have been unsuccessful in attempts to restart the machine so this would appear to be a hardware failure. Our sysadmin and VoIP dev team are currently building a replacement for this machine. At present we are unable to offer a concrete ETA but we will update again as soon as possible with any updates. Our next scheduled update will be in approximately one hour.

***Update 21:26***

VoIP: Our team have now restored our VoIP provisioning systems.  If any further issues are seen with VoIP-related updates, please contact our support team.

Hosting: Due to this hardware failure, we are restoring database access to a recent backup (approximately 1AM this morning).  Changes to databases on mysqldb.gradwell.com today will have been lost.  Other database servers are not affected, and we apologise for any inconvenience this may cause you.  This work should complete shortly.

*** Update ***

This issue was resolved on the day the alert was announced on www.gradwellstatus.com  It has now been closed as a historical problem.  Please note that the replacement MySQL server runs MySQL 5

Virtualisation SAN outage

We have been alerted to a fault on one of our virtualisation SANs which is affecting a large amount of our infrastructure. An engineer is on his way to repair the fault and we hope to update this notice shortly with further information.

UPDATE 1727: Phones using the gradwell SIP platform should now be working again. If you are continuing to have problems please try rebooting your device.

UPDATE 1957: Most services should have come fully back to normal within the last few moments. Secondary DNS customers may find that due to some zonefile corruption their zones will not be served from our autoritative servers yet. The zone files are rebuilding at the moment and we will update when this is completed.

UPDATE 2024: Some residual issues with outbound call setup times have now been resolved.

UPDATE 0035: All services appear to have been running normally for some time. If you are experiencing any continuing issues please raise a ticket with support.

RESOLVED: Call Set Up Issues

We are currently seeing call set up problems with the VoIP platform. This will be seen as calls taking a long time to set up, or failing to set up completely. Our system engineers are working with at the highest priority to diagnose and resolve the issues being seen.

We apologise for any inconvenience caused and will post further updates as soon as they are available.

***UPDATE*** 12:00

The issue has now been resolved with outbound calls issues. We will be posting a report soon about what caused the problems.

RESOLVED: Call setup delays

Some users are seeing problems with longer than usual call setup times. Our VoIP team are working on this now and we will update here as soon as an update is available.

We apologise for any problems caused by this.

***Update 22:00***

We have had no further reports of setup delays as of 20:00 but we are continuing to monitor. Our out of hours team will continue to monitor and we will close this status update when we have gathered enough data to be satisfied that there are no further delays.

RESOLVED: DNS issues

Some users may have experienced an increase in call setup times and other DNS related problems for the last few minutes as one of our DNS servers had failed over to backup. Full connectivity has been restored by our sysadmin team and no further problems should be seen.

We apologise for any disruption caused by this.

NEWS:Recent DNS issues

In the last two weeks one of our DNS cache servers has stopped responding, briefly, a couple of times, causing customers phone calls to not connect. We are currently in the process of fully understanding the cause of these problems, but have concluded some initial investigations.

We have DNS servers 191 and 91, these keep a heartbeat between each other so they know when to failover. The problem has been with the heartbeat failing and the network “arp cache” not sending traffic to the correct (working) host, so that requests continue to go to the wrong (broken) server.

The sysadmin team have been developing strategies to prevent this happening again, to include using some improved configuration management to deploy different DNS configurations, and local DNS cache servers, across our clusters of servers.

This, along with some re-engineering of the way these IPs fail over, should add a great deal more stability and resilience.

However, the VoIP/SIP servers do of course have both DNS server pairs configured in a high-availability mode, so we would expect them to continue working and if the first fails, they should simply query the second one.

This has obviously not happened, which may be due to a bug in the sip proxy software we use, and the VoIP dev team are currently trying to reproduce this in lab conditions so we may better understand how to stop it happening in future.

RESOLVED: Inbound calls to our support team

We are seeing some problems with customers calling in to our support numbers. Our server admin team are working on this now and will restore normal service ASAP. In the meantime please raise any support incidents via email to support@gradwell.com

***Update 14:24***

This is now fully resolved. We have been monitoring for the last 20 minutes and calls are being routed correctly. We apologise for any problems caused by this