Gradwell Internet for business people

Gradwell Service News

Monthly Archive September, 2009

RESOLVED: DSL Exchange Outage

We have been informed by our carrier that the following exchanges are experiencing a power related issue and a loss of backhaul, as such customers connected to these exchanges will experience a total loss of internet services.

Geographic location of affected services:

porthcawl

thatcham

newbury

alton

newbury

basingstoke

thatcham

alton

tadley

tadley

caversham

caversham

porthcawl

basingstoke

Engineers and power contractors are onsite and have identified an issue which they are now working to resolve.

Further updates will be posted once received from our carrier

UPDATE 12:16 - All sites affected are still without connectivity, our carrier and its contractors are expecting the next update in an hours time.

UPDATE 13:42 - Connectivity has been restored to all affected sites, however this may be inturrupted as engineering works are ongoing to correct the original fault.

UPDATE 16:59 - This issue has now been resolved, connectivity to all affected sites has been fully restored.

RESOLVED: Core Network Issues

ADDED: 08:30 24th Sept

We have been experiencing intermittent periods of poor network performance over the last 24 hours.

This is being caused by a fault that has developed on our main cisco switch stack in which it’s CPU is running at 100%, typically for intermittent bursts of 4-5 minutes.

We have esclated this issue to the Cisco Assistance Centre and we are anticipating their diagnosis this morning. We will then be able to move forward.

If hardware replacement is necessary, we can complete this within 4 hours, likely at the end of Thursday’s business day.

Please accept our apologies for this fault and be assured that it is being pursued.

UPDATED: 09:30 24th Sept

Our network managment suppliers have escalated our Cisco case to “P1″ status, and are currently waiting on their response.

We have also lowered the CPU load on our core switches which has significantly reduced the packet loss on the network, and we are currently taking steps to isolate our voice network onto seperate routing equipment.

UPDATED: 10:45 24th Sept

Our network is fairly stable, but we are still seeing some issues with call quality and broadband speed.

Cisco are currently working on a resource exhaustion theory, where by the reload of our switch last wednesday has triggered a bug that has manifest itself slowly over the week.  We are currently probing their diagnosis, but there is a possibility that a further switch outage of approximately 10 minutes later today may be necessary in order to correct the problem.

UPDATED: 13:00 24th Sept

Following expert advice from Cisco and our network management consultants we have identified that the problem is due to memory allocation (TCAM profile) on the Cisco switches we use.

This meant that if our network was being scanned (for example by a virus or potential intruder) then part of the processor ran out of memory and caused CPU load to become very high, causing problems with the switches performance.

We have disabled a number of features on these which has made more memory available and will monitor the switch performance this afternoon, as it currently seems stable.

It may be necessary to reboot the switch, but we can now perform this out of hours.

We will update this status message again at 18:00 unless a further problem develops.

Update 18:00: We believe that this issue is now fully resolved after configuration changes were made around 13:00 this afternoon, and we are happy that the core network is now stable.

Update 20:20: We are currently still experimencing problems with our core network and are awaiting urgent information from our network engineers.

Update 20:40: Our network team are restoring our switch configuration to the known working configuration prior to last Wednesday’s scheduled maintenance and will be performing an emergency reboot of the switch shortly, which will result in a brief outage.

Update: 21:25: We have seen a decrease in latency over our core network since 20:45, however, some packet loss is continuing.  Our network team will restart our switches with a reverted configuration within the next few minutes.

Update 21:45: Our network team have now restarted our core switches with an older configuration.  All services should be working normally and our internal monitoring is reporting success for all services.

RESOLVED: Customer crons

Many customer crontabs have not been run between 11:00 and 18:00 today due to emergency configuration changes which were intended to control load on our main shell server.  We have now restored this functionality and the crontab command will give details of the server running the crontabs.  Please note that customers may need to manually run any jobs which were scheduled to be started within these hours, if they are important.  Apologies for any inconvenience.

RESOLVED: IAX Trunks

10:36 - We are currently experiencing an issue which is affecting both inbound and outbound calls over our IAX platform.

Our engineering teams have been altered to this issue and are currently investigating the cause of the problem.

Further updates will be posted once information has been made available.

UPDATE 10:47 - Engineers have now restored service to the IAX Platform, calls are successfully passing through the load balanced infrastructure.

UPDATE 11:23 - We have become aware that customers may be experiencing audio issues on the IAX platform including voice loss , our engineers are investigating the cause of this issue

Update Sep 24th 18:00: This issue has now been closed, the IAX problems initially reported yesterday were related to the core network issues.  We have received reports from customers that IAX call quality has recovered.

RESOLVED - Zone file duplicates

Some users are seeing duplicates of the same zone file within the hosting control panel.

If you are affected by this, we would ask you to not make any changes until we have resolved this as any change made may possibly corrupt the zone file.We will post an update here as soon as possible.

We apologise for any problems this causes you.

****Update 10:26****

This issue is now totally cleared. If you are seeing any oddities or missing any zone files, please raise an incident with support and we will investigate and restore.

RESOLVED - Outbound call problems

Some of our customers may be seeing problems making outbound calls. Our server team are working on this now and we will have this resolved as soon as possible. We will post all updates here as they are available.

Our apologies for any problems this is causing.

****Update - 09:17****

Calls are now passing outbound, once we have tested fully we will update further

****Update - 09:38****

We have now tested fully and are now seeing that calls are passing outbound as expected. Our server administrators have restarted the opensips proxy that handles the outbound calls and will be making further configuration changes overnight to ensure this problem cannot happen again. If you are still seeing problems making outbound calls, please raise a support incident to support@gradwell.com

Again our apologies for any problems caused by this.

COMPLETED: Mailman datacentre migration

We will be migrating our Mailman ‘list.yourdomain.com’ interface to a server in a different datacentre shortly, and we need to change the IP address of the system from 193.84.87.101 to 212.11.71.211.  We will be automatically modifying all hosted domains automatically, and the change should take effect in approximately an hour.  Due to the nature of list archives and mail queues, we will not be able to operate both servers during the switch over, so customers may experience some, minimal down time with this service.

Update 17:11 - We have resolved an issue with the new server configuration which was preventing the migration.  The old Mailman server has now been disabled, and the web interfaced removed to ensure nothing changes before our final migration to the new server.

Update 19:47 - Apologies for the delay with this new server, which had to copy the remaining mail over from the old server, without any changes being accepted, the new server is now active on 212.11.71.211, all hosted DNS has been updated and is now propagating over the internet.  This maintenance has been essential because the IP address space available to us in Sovereign House is being re-allocated.

Our senior engineers are currently investigating an issue with our Mailman implementation which is currently causing delivery issues, although the web interface is now working correctly.

Update 22:30 - Our Mailman migration is now complete, our engineers have found and corrected the issue with mail not being delivered and the server has been processing queued mail correctly for the past 30 minutes.

RESOLVED: DSL Ordering and Portal

10:33 - Due to a systems failure at our carrier, currently the DSL Ordering system is unavailable , this also affects operations that customers can perform in their portal such as line profile changes.

Please note, this is not affecting the ability for customers to connect to our DSL platform.

Further updates will be posted as they are received from our carrier.

This issue has now been resolved by our carrier and all systems are now working correctly.

RESOLVED: Intermittent outbound call failures

We are seeing some issues with outbound calls failing. Our server team and switch engineers are working on this now and will get this resolved as soon as possible.

We will post all updates here as they are avaliable. Apologies for any problems this is causing

***UPDATE 15:46***

All outbound calls are connecting but there may be a slight delay in connection whilst the call is routed.

***UPDATE 16:32***
This issue has been related to capacity at our PSTN interconnects. Outbound calls are now routing correct and we are not seeing any outbound call failures. We are introducing more capacity tonight to ensure this issue does not happen again.

COMPLETED: Web and database maintenance

We will be conducting overnight maintenance in our London datacentre this evening which will involve moving our database hosts and removing our legacy FreeBSD web cluster servers.  The maintenance will commence at 22:00 and is expected to continue until 06:00.  During this time, some services may not be available for short periods.  We will be decommissioning the legacy servers - Lilac and Ochre, SIP and shell services respectively, and mysql5db-1 will be temporarily unavailable, as it will be physically re-located within the datacentre.  Our core hosting database will also be moved, which will affect most services, including the ability to log into mailboxes.  We aim to keep this downtime to a minimum.

Update 02:45: Please note that we are now performing database maintenace, which will prevent VoIP calls completeing, or mail from being collected.  Websites will continue to function, however, configuration changes cannot be completed at this site and control panels will bee temporarily inaccessible.

Update: 02:55: We have moved our core hosting database to a temporary location to minimize any disruption.  All services should now be available again.

Update 03:23 - We have taken down mysql5db-1 (customer MySQL server) and one core hosting database slave. Inbound and outbound calls plus mail is 100% operational, some customer databases will be unavailable for a short duration. Web hosting should also be fine.

Update: This maintenance was completed by 08:00.