Gradwell Internet for business people

Gradwell Service News

Hosting

RESOLVED: Power outage in Telehouse North

PROBLEM DESCRIPTION

Telehouse Power Failure

Several rooms in Telehouse North lost power at approximately 14:10 today.
This means that :

Some networks, or portions of networks with particular dependency on
Telehouse North will be off the air.
Your connection to some of your ISPs may terminate on equipment in
Telehouse North which has failed.
All networks that are unaffected will be handling much higher volumes
of traffic,leading to higher latency and packet loss even on connections
that are still available.

Affected Services:

All services

Customer Impact:

Dependant on users ISP peering/routing

Estimated Resolution Time:

We will update again by 16:30

***Update 15:13***

All systems should now be clearing and getting back to normal in Telehouse. Some routing issues will still remain for a little while longer and links will be seeing a much higher throughput as traffic has been re-routed.

***Closing Update 15:51***

Most peering links are now stable and disruption is minimal but please do be aware there still may be slight congestion. Our systems remained online and are not experiencing any problems caused by the earlier power outage.

RESOLVED: Loss of storage SAN causing network wide problems

PROBLEM DESCRIPTION

At approximately 16:35 BST, Gradwell’s system administrators were alerted to a problem with one of our storage SANs

Affected Services:

All services

Customer Impact:

All services may be affected, including VOiP and Hosting. Loss of phone registration and inability to call out or receive calls. Some websites will be offline.

Estimated Resolution Time:

We are investigating this now and will update by 17:45 with more information

***Update 17:00***

All lines into our support office are also offline at present


***Update 17:20***

Services are now starting to return to normal as we are restarting many affected services and servers. Our support line is now back online

***Update 18:15***

Most servers have now been restarted and most services are back on line. We are continuing to resolve any remaining issues.

***Update 18:31***

All services should now be available.

RESOLVED: Hosting and VoIP platform issues

PROBLEM DESCRIPTION

At approximately 6:50  BST, Gradwell’s system administrators were alerted to some system issues affecting a multitude of Gradwell services.

Affected Services:

Web services

Control panels

VoIP services

Customer Impact:

Affected customers will be seeing errors when accessing hosted websites/control panels and will be seeing errors when attempting to make outbound calls.

Estimated Resolution Time:

Our system admin team are working on this now and will update again at or before 09:30.

***Update*** 9:35

VoIP services and control panels should now be working as expected, we are still working on the web clusters and expect to have these back online shortly. We will update again at or before 10:30

***Update*** 10:36

The web cluster is now back online and all services should be running as expected.

There may be some slowdown on control panels as systems are busy processing any backlogs.

We are continuing to monitor and will update again at 11:30

***Update*** 11:29

All systems are running correctly and remain stable. We will continue to monitor closely for the next few hours and update/close this status at 13:30

***Update*** 13:20

All systems are now running correctly and we are now closing this status update.

The problem has been identified as being one of our DNS cache servers. This cache server, 193.111.200.191, stopped responding and this in turn caused our master MySQL server to effectively lock up. This then failed to respond to queries correctly. The majority of our infrastructure relies on this database, hence parts of it became unstable.

We apologise for any problems this has caused you.

RESOLVED: Customer MySQL DB server and PBX updates

PROBLEM DESCRIPTION

voip-manager and mysqldb are currently experiencing an outage.

At approximately 18:00 BST, our systems team became aware of an issue affecting both our provisioning and one of our customer-facing database servers.  This fault appears to have started at around 17:00, and was a progressive problem, as the relevant servers became less responsive over time.  We are presently investigating the fault, and working with our Telehouse operations team to rectify the problem as soon as possible.

Affected Services:

PBX provisioning

Customer databases on mysqldb.gradwell.com

Customer Impact:

Whilst all VoIP services are up, changes cannot be processed, so any updates performed via the control panels will not become live.

All customer databases hosted on the mysqldb.gradwell.com service. This will not affect other servers, such as mysql5db-1 or our other two MySQL 4-based servers.


Estimated Resolution Time:

We are presently awaiting on-site engineers restarting the affected services and will provide an update within an hour.

***Update 19:38***

Our on-site ops team have been unsuccessful in attempts to restart the machine so this would appear to be a hardware failure. Our sysadmin and VoIP dev team are currently building a replacement for this machine. At present we are unable to offer a concrete ETA but we will update again as soon as possible with any updates. Our next scheduled update will be in approximately one hour.

***Update 21:26***

VoIP: Our team have now restored our VoIP provisioning systems.  If any further issues are seen with VoIP-related updates, please contact our support team.

Hosting: Due to this hardware failure, we are restoring database access to a recent backup (approximately 1AM this morning).  Changes to databases on mysqldb.gradwell.com today will have been lost.  Other database servers are not affected, and we apologise for any inconvenience this may cause you.  This work should complete shortly.

*** Update ***

This issue was resolved on the day the alert was announced on www.gradwellstatus.com  It has now been closed as a historical problem.  Please note that the replacement MySQL server runs MySQL 5

Virtualisation SAN outage

We have been alerted to a fault on one of our virtualisation SANs which is affecting a large amount of our infrastructure. An engineer is on his way to repair the fault and we hope to update this notice shortly with further information.

UPDATE 1727: Phones using the gradwell SIP platform should now be working again. If you are continuing to have problems please try rebooting your device.

UPDATE 1957: Most services should have come fully back to normal within the last few moments. Secondary DNS customers may find that due to some zonefile corruption their zones will not be served from our autoritative servers yet. The zone files are rebuilding at the moment and we will update when this is completed.

UPDATE 2024: Some residual issues with outbound call setup times have now been resolved.

UPDATE 0035: All services appear to have been running normally for some time. If you are experiencing any continuing issues please raise a ticket with support.

RESOLVED: empty home folders

It has been brought to our attention that following the storage issues we experienced yesterday that some customers have been left with empty home folders. This was due to a backend script creating duplicate home folders. These duplicates have now been removed and all home folders should show the correct data. If you are still having issues please contact support.

RESOLVED: Problems with fileserver flathead

We are currently experiencing ongoing issues with one of our fileservers flathead. Currently the server is offline completely, however, we are hoping to bring it back online shortly.

When this happens some customers may find that their files are out of date. We will be initially starting the server off our latest backup. During the day we will by attempting to restore more up to date data, we will update further on this as more details are available.

Please accept our apologies for any inconvenience this causes, unfortunately the situation is unavoidable. Engineers have been working through the night trying to restart the old partition but this is not possible under full traffic.

***Update 15:04***

‘Flathead’ is now back up and running and has been for a while, websites should be displaying correctly and no errors should be shown. Our sysadmin team are still onsite monitoring this server closely and we will update again shortly.

Again our apologies for any inconvenience this has caused you.

*** Update 17:20 ***

We were aware of a number of intermittent web page access problems. we have spent the last hour checking individual sites and are confident we have ironed out any remaining issues in our web cluster and that this problem is now resolved.

COMPLETED:Filestore maintenance - Friday 28th May 00:01 to 02:00

Update 06:00: The filesystem on flathead is offline for a consistency check

Update 04:55: All services are currently online. We will continue to monitor the situation.

Update 03:30: We are seeing some unusual errors on the home file partition on sawtooth that are causing performance issues with the customer web clusters. The home partition on sawtooth is currently offline for a full consistency check.

Update 02:00: All services are back online

Update 01:41: Work on flathead is not going to complete before 2am, we will return flathead to service within the next few minutes and schedule a further window to complete this work at a later date.

Update 00:51: Work on all servers except flathead is now complete. Flathead will be unavailable for some time longer while it completes essential filesystem maintenance.

We are performing some maintenance on all of our user mail and home filestore servers during the above window. We have increased the storage available to each filestore and because of this each server needs to be rebooted to pickup its new disk quota.

Each filestore will be offline, one by one, for approximately 10 minutes each apart from ‘flathead’ which is being moved to another host. This server may be offline for a little longer.

We apologise for any inconvenience caused by this work.

***Update***

The majority of the work has completed.

Please see http://www.gradwellstatus.com/2010/05/28/ongoing-problems-with-fileserver-flathead/ for further updates

RESOLVED: Clearwater NFS Server Outage

We are currently experiencing an issue with one of our NFS home file servers - Clearwater.

Our System administration team are currently working on resolving the problem and restoring service.

Customers hosted on this server will find that some services such as Web Hosting will be unavailable.

We apologise for any inconvenience caused and are working to restore service to affected customers as soon as possible. Support staff will update this post once further news is received.

Update (19:20): This server is now functioning correctly again.  Please accept our apologies for any inconvenience.

Update (21:45): Unfortunately this file server has not been stable and keeps crashing. Therefore we are currently deploying a replacement server for it and will re-mount the hard disks storing user files to resume service. We expect this service to be back online before midnight tonight.

Update: (22:50): All web hosting services are now back online - we also experienced a problem with our ‘flathead’ file server which is also resolved.

Please accept our apologies for the inconvenience caused.

RESOLVED: Email and MySQL issues

We are experiencing some issues with email storage servers and customer MySQL servers after our scheduled maintenance which completed in the early hours of this morning.  Our system administration team are currently investigating the problems.

Update (14:00): The email and MySQL issues are now resolved.