Gradwell Internet for business people

Gradwell Service News

VoIP

RESOLVED: Power outage in Telehouse North

PROBLEM DESCRIPTION

Telehouse Power Failure

Several rooms in Telehouse North lost power at approximately 14:10 today.
This means that :

Some networks, or portions of networks with particular dependency on
Telehouse North will be off the air.
Your connection to some of your ISPs may terminate on equipment in
Telehouse North which has failed.
All networks that are unaffected will be handling much higher volumes
of traffic,leading to higher latency and packet loss even on connections
that are still available.

Affected Services:

All services

Customer Impact:

Dependant on users ISP peering/routing

Estimated Resolution Time:

We will update again by 16:30

***Update 15:13***

All systems should now be clearing and getting back to normal in Telehouse. Some routing issues will still remain for a little while longer and links will be seeing a much higher throughput as traffic has been re-routed.

***Closing Update 15:51***

Most peering links are now stable and disruption is minimal but please do be aware there still may be slight congestion. Our systems remained online and are not experiencing any problems caused by the earlier power outage.

RESOLVED: Loss of storage SAN causing network wide problems

PROBLEM DESCRIPTION

At approximately 16:35 BST, Gradwell’s system administrators were alerted to a problem with one of our storage SANs

Affected Services:

All services

Customer Impact:

All services may be affected, including VOiP and Hosting. Loss of phone registration and inability to call out or receive calls. Some websites will be offline.

Estimated Resolution Time:

We are investigating this now and will update by 17:45 with more information

***Update 17:00***

All lines into our support office are also offline at present


***Update 17:20***

Services are now starting to return to normal as we are restarting many affected services and servers. Our support line is now back online

***Update 18:15***

Most servers have now been restarted and most services are back on line. We are continuing to resolve any remaining issues.

***Update 18:31***

All services should now be available.

RESOLVED: Hosting and VoIP platform issues

PROBLEM DESCRIPTION

At approximately 6:50  BST, Gradwell’s system administrators were alerted to some system issues affecting a multitude of Gradwell services.

Affected Services:

Web services

Control panels

VoIP services

Customer Impact:

Affected customers will be seeing errors when accessing hosted websites/control panels and will be seeing errors when attempting to make outbound calls.

Estimated Resolution Time:

Our system admin team are working on this now and will update again at or before 09:30.

***Update*** 9:35

VoIP services and control panels should now be working as expected, we are still working on the web clusters and expect to have these back online shortly. We will update again at or before 10:30

***Update*** 10:36

The web cluster is now back online and all services should be running as expected.

There may be some slowdown on control panels as systems are busy processing any backlogs.

We are continuing to monitor and will update again at 11:30

***Update*** 11:29

All systems are running correctly and remain stable. We will continue to monitor closely for the next few hours and update/close this status at 13:30

***Update*** 13:20

All systems are now running correctly and we are now closing this status update.

The problem has been identified as being one of our DNS cache servers. This cache server, 193.111.200.191, stopped responding and this in turn caused our master MySQL server to effectively lock up. This then failed to respond to queries correctly. The majority of our infrastructure relies on this database, hence parts of it became unstable.

We apologise for any problems this has caused you.

RESOLVED: Customer MySQL DB server and PBX updates

PROBLEM DESCRIPTION

voip-manager and mysqldb are currently experiencing an outage.

At approximately 18:00 BST, our systems team became aware of an issue affecting both our provisioning and one of our customer-facing database servers.  This fault appears to have started at around 17:00, and was a progressive problem, as the relevant servers became less responsive over time.  We are presently investigating the fault, and working with our Telehouse operations team to rectify the problem as soon as possible.

Affected Services:

PBX provisioning

Customer databases on mysqldb.gradwell.com

Customer Impact:

Whilst all VoIP services are up, changes cannot be processed, so any updates performed via the control panels will not become live.

All customer databases hosted on the mysqldb.gradwell.com service. This will not affect other servers, such as mysql5db-1 or our other two MySQL 4-based servers.


Estimated Resolution Time:

We are presently awaiting on-site engineers restarting the affected services and will provide an update within an hour.

***Update 19:38***

Our on-site ops team have been unsuccessful in attempts to restart the machine so this would appear to be a hardware failure. Our sysadmin and VoIP dev team are currently building a replacement for this machine. At present we are unable to offer a concrete ETA but we will update again as soon as possible with any updates. Our next scheduled update will be in approximately one hour.

***Update 21:26***

VoIP: Our team have now restored our VoIP provisioning systems.  If any further issues are seen with VoIP-related updates, please contact our support team.

Hosting: Due to this hardware failure, we are restoring database access to a recent backup (approximately 1AM this morning).  Changes to databases on mysqldb.gradwell.com today will have been lost.  Other database servers are not affected, and we apologise for any inconvenience this may cause you.  This work should complete shortly.

*** Update ***

This issue was resolved on the day the alert was announced on www.gradwellstatus.com  It has now been closed as a historical problem.  Please note that the replacement MySQL server runs MySQL 5

Resolved: Call drops on hosted VoIP platform - SIP only

Affected Services:

VoIP - SIP Intermittent call drops using hosted VoIP platform (sip.gradwell.com/sip.trunk.gradwell.com)

Customer Impact:

Customers may be seeing some intermittent call drops. IAX trunk users should be unaffected

Action:

We are implementing a code fix to the live platform and are rolling out these changes now.

***Update 12:20***

We have rolled out the code fix to all live servers and this seems to have worked as expected. We will continue to monitor for the next few hours to make sure there are no further problems

***Update 15:10***

We have observed no further issues with the SIP hosted VoIP platform and hence are happy to close off this status update. We apologise for any problems these earlier issues may have caused you

COMPLETED: Hosted PBX/single line registration maintenance

Starts: 2010-06-01 23:00 Ends: 2010-06-01 23:59

We will be performing maintenance on the VoIP registration platform to ensure that our state databases are consistent this evening.  These checks should not interrupt the service for customers, although the network should be considered ‘at risk’ at this time.  These consistency checks do not affect SIP or IAX trunk customers.

This maintenance was completed successfully.

Virtualisation SAN outage

We have been alerted to a fault on one of our virtualisation SANs which is affecting a large amount of our infrastructure. An engineer is on his way to repair the fault and we hope to update this notice shortly with further information.

UPDATE 1727: Phones using the gradwell SIP platform should now be working again. If you are continuing to have problems please try rebooting your device.

UPDATE 1957: Most services should have come fully back to normal within the last few moments. Secondary DNS customers may find that due to some zonefile corruption their zones will not be served from our autoritative servers yet. The zone files are rebuilding at the moment and we will update when this is completed.

UPDATE 2024: Some residual issues with outbound call setup times have now been resolved.

UPDATE 0035: All services appear to have been running normally for some time. If you are experiencing any continuing issues please raise a ticket with support.

RESOLVED: Call Set Up Issues

We are currently seeing call set up problems with the VoIP platform. This will be seen as calls taking a long time to set up, or failing to set up completely. Our system engineers are working with at the highest priority to diagnose and resolve the issues being seen.

We apologise for any inconvenience caused and will post further updates as soon as they are available.

***UPDATE*** 12:00

The issue has now been resolved with outbound calls issues. We will be posting a report soon about what caused the problems.

COMPLETED:Network maintenance and upgrades

Further to our previous announcement, our rescheduled network maintenance will commence on Friday 30th. The window of maintenance will be 23:00 Friday 30th to 02:00 Saturday 1st of May.

During this period we will be performing a few tasks including installing new hardware, performing work on our DNS caches, reconfiguring some access switches and making some cabling changes. If time permits we will also take this opportunity to make some configuration changes to the PHP cluster to enhance performance.

Due to the nature of this work, segments of the network will be inaccessible for short blocks of time and as such this may result in reduction of service during the given window. VoIP, DSL and hosting will all be affected for short periods.

We will post here again when the work is completed and has been fully tested.

RESOLVED: Call setup delays

Some users are seeing problems with longer than usual call setup times. Our VoIP team are working on this now and we will update here as soon as an update is available.

We apologise for any problems caused by this.

***Update 22:00***

We have had no further reports of setup delays as of 20:00 but we are continuing to monitor. Our out of hours team will continue to monitor and we will close this status update when we have gathered enough data to be satisfied that there are no further delays.