Gradwell Internet for business people

Gradwell Service News

DNS

RESOLVED: Power outage in Telehouse North

PROBLEM DESCRIPTION

Telehouse Power Failure

Several rooms in Telehouse North lost power at approximately 14:10 today.
This means that :

Some networks, or portions of networks with particular dependency on
Telehouse North will be off the air.
Your connection to some of your ISPs may terminate on equipment in
Telehouse North which has failed.
All networks that are unaffected will be handling much higher volumes
of traffic,leading to higher latency and packet loss even on connections
that are still available.

Affected Services:

All services

Customer Impact:

Dependant on users ISP peering/routing

Estimated Resolution Time:

We will update again by 16:30

***Update 15:13***

All systems should now be clearing and getting back to normal in Telehouse. Some routing issues will still remain for a little while longer and links will be seeing a much higher throughput as traffic has been re-routed.

***Closing Update 15:51***

Most peering links are now stable and disruption is minimal but please do be aware there still may be slight congestion. Our systems remained online and are not experiencing any problems caused by the earlier power outage.

RESOLVED: Loss of storage SAN causing network wide problems

PROBLEM DESCRIPTION

At approximately 16:35 BST, Gradwell’s system administrators were alerted to a problem with one of our storage SANs

Affected Services:

All services

Customer Impact:

All services may be affected, including VOiP and Hosting. Loss of phone registration and inability to call out or receive calls. Some websites will be offline.

Estimated Resolution Time:

We are investigating this now and will update by 17:45 with more information

***Update 17:00***

All lines into our support office are also offline at present


***Update 17:20***

Services are now starting to return to normal as we are restarting many affected services and servers. Our support line is now back online

***Update 18:15***

Most servers have now been restarted and most services are back on line. We are continuing to resolve any remaining issues.

***Update 18:31***

All services should now be available.

Virtualisation SAN outage

We have been alerted to a fault on one of our virtualisation SANs which is affecting a large amount of our infrastructure. An engineer is on his way to repair the fault and we hope to update this notice shortly with further information.

UPDATE 1727: Phones using the gradwell SIP platform should now be working again. If you are continuing to have problems please try rebooting your device.

UPDATE 1957: Most services should have come fully back to normal within the last few moments. Secondary DNS customers may find that due to some zonefile corruption their zones will not be served from our autoritative servers yet. The zone files are rebuilding at the moment and we will update when this is completed.

UPDATE 2024: Some residual issues with outbound call setup times have now been resolved.

UPDATE 0035: All services appear to have been running normally for some time. If you are experiencing any continuing issues please raise a ticket with support.

COMPLETED:Network maintenance and upgrades

Further to our previous announcement, our rescheduled network maintenance will commence on Friday 30th. The window of maintenance will be 23:00 Friday 30th to 02:00 Saturday 1st of May.

During this period we will be performing a few tasks including installing new hardware, performing work on our DNS caches, reconfiguring some access switches and making some cabling changes. If time permits we will also take this opportunity to make some configuration changes to the PHP cluster to enhance performance.

Due to the nature of this work, segments of the network will be inaccessible for short blocks of time and as such this may result in reduction of service during the given window. VoIP, DSL and hosting will all be affected for short periods.

We will post here again when the work is completed and has been fully tested.

RESOLVED: DNS issues

Some users may have experienced an increase in call setup times and other DNS related problems for the last few minutes as one of our DNS servers had failed over to backup. Full connectivity has been restored by our sysadmin team and no further problems should be seen.

We apologise for any disruption caused by this.

NEWS:Recent DNS issues

In the last two weeks one of our DNS cache servers has stopped responding, briefly, a couple of times, causing customers phone calls to not connect. We are currently in the process of fully understanding the cause of these problems, but have concluded some initial investigations.

We have DNS servers 191 and 91, these keep a heartbeat between each other so they know when to failover. The problem has been with the heartbeat failing and the network “arp cache” not sending traffic to the correct (working) host, so that requests continue to go to the wrong (broken) server.

The sysadmin team have been developing strategies to prevent this happening again, to include using some improved configuration management to deploy different DNS configurations, and local DNS cache servers, across our clusters of servers.

This, along with some re-engineering of the way these IPs fail over, should add a great deal more stability and resilience.

However, the VoIP/SIP servers do of course have both DNS server pairs configured in a high-availability mode, so we would expect them to continue working and if the first fails, they should simply query the second one.

This has obviously not happened, which may be due to a bug in the sip proxy software we use, and the VoIP dev team are currently trying to reproduce this in lab conditions so we may better understand how to stop it happening in future.

RESOLVED: DNS cache issues

We have experienced a problem with one of our DNS caches. This may have caused some problems with call setup times, phone registration and other services. Our server admin team have cleared the fault and are continuing to closely monitor while the system is in the process of coming back up to full working order .

We apologise for an inconvenience this has caused

RESOLVED: Caching DNS server

One of our customer facing caching DNS resolvers, with address 193.111.200.91 was not responding to DNS queries from around 04:00 until just after 07:00 today.  Customers are advised to configure computers and phones with at least two resolver IP addresses:
193.111.200.91 and 193.111.200.191

We have corrected the problem and reported this fault back to the engineers who set up these machines.  We apologise for any inconvenience caused.

RESOLVED: ESX host problems

We are in the process of rebooting one of our ESX machines ‘Mars’, this will affect several services including our public DNS resolvers.

VoIP Customers may experience DNS related issues if phones are configured to use 193.111.200.91 and 193.111.200.191 as their DNS Servers. DSL Customers may experience an outage where web sites cannot be reached.

Our server admin team are working on this now and will try to keep disruption to a minimum. Our apologies for any problems experienced because of this

Update 12:51: Our DNS resolvers are now back online and working correctly.  This issue will have resulted in a problem setting up calls.  We are currently bringing back other services such as control panels are working as quickly as possible to restore any affected services.

Update 13:16: Control panel access has been restored.

RESOLVED - Zone file duplicates

Some users are seeing duplicates of the same zone file within the hosting control panel.

If you are affected by this, we would ask you to not make any changes until we have resolved this as any change made may possibly corrupt the zone file.We will post an update here as soon as possible.

We apologise for any problems this causes you.

****Update 10:26****

This issue is now totally cleared. If you are seeing any oddities or missing any zone files, please raise an incident with support and we will investigate and restore.