We are currently experiencing an intermittent issue with one of our back-end storage servers which handles some of our internal scheduled tasks. This server is being checked at the moment and we have been alerted to a possibility that this is causing legacy FreeBSD-hosted sites to not respond in a timely manner. We are working to bring this server back up as quickly as possible and move the data to an alternative server.
Update 21:48: We have brought this server back online and are continuing to migrate all data from it, as this serever remains at risk. Our internal monitoring systems show that lon-web-1 and lon-web-2 have now recovered.
Update: June 30th 23:00: We have now migrated enough from our legacy system ‘Azure’, to mark our new HP/Lefthand copy active and do not expect any more outages with Azure to to affect back-end scripts from running and we therefore consider this issue closed.
We are currently experiencing an outage of one of our core storage servers - Europa which will temporarily affect access to our part of our web-services and some other services. We will publish more information as soon as it is available.
Update 16:25: Our core storage server has now come back online, however, some of the dependent services, such as webmail, some clustered SMTP outbound and web servers have some file system errors are being restarted.
Individual customer mailboxes or sites are not affected due to our migration to HP/Lefthand storage.
RESOLVED by 16:51: We have now restarted any affected machines, core VoIP services were not affected, although we have had to restart both of our inbound Asterisk-based fax servers. The majority of customers are not using these. Our web load balancer was briefly unavailable, which will have caused a brief outage with customer hosted web-sites. Our monitoring system is continuing to report some machines which have had file system errors and we are intervening as appropriate, however, the majority of these are clustered services, and so customers should not see any effects.
We have restarted our VoIP and hosting control panels shortly due to a file-system problem.
Apologies for any inconvenience this might caused.
We will be perfoming essential maintenance on our virtualisation platform this evening at 22:00 which will involve home directories temporarily being unavailable for customers on Clearwater and Dixie. This will also mean that web-sites hosted on these servers are temporarily unavailable.
Additionally mail will be temporarily unavailable for customers with mailboxes on the mail servers ‘Badlands’, ‘Yosemite’, ‘Denali’ and ‘Yellowstone’. We will be working on these servers individually in order to reduce down time to a minimum.
This maintenance is now in progress.
Update 22:37 - We have now migrated the mail servers and work on the home file stores has started.
Update 23:14 - We have now finished work on customer data stores are are looking at some remaining issues with mail servers.
Update 01:00 -
All services are currently live, due to unexpected networking issues, the mail file servers have not been migrated at this time, but will continue to be active on the old installation.
Engineers are currently investigating an outage on one of our telehouse disk shelves. Some services may be unavailable until further notice. Please accept our apologies for any inconvenience. We will update this report when we have further information.
Update - 20:00- We are seeing some issues with the web clusters or rather the data that feeds the web clusters. Some visitors will be greeted with 403 error pages instead of content at present. Our server team are still working hard on this and will implement a fix as soon as possible.
Update - 21:00 This issue has now been resolved. Our systems admins will be continuing to monitor this issue throughout the evening. The 403 forbidden errors customers may have seen were due to an incorrect networking configuration on one of our new file storage servers. The internal issue for this is 1831. We apologise for any inconvenience caused by this problem, and we will be investigating how best this type of problem can be avoided in the future.
Due to a small number of customers keeping database connections open for a long time, we have implemented a new policy to disconnect any open connection to the customer database servers which is open any longer than 15 minutes. Some code may not handle this correctly, so we would encourage customers to check any scripts they run for backups or log/data processing which takes a long time.
Code which assumes the database connection will stay open indefinitely during the process may need to be altered by customers.
We have made the following changes to the Gradwell control panels this afternoon:
- All of our control panels are now hosted on *.gradwell.com.
- Our control panels share a new look & feel and Gradwell’s new logo.
These changes are the first step in the launch of Gradwell’s refreshed brand. There will be more changes from us in the coming days.
If you spot any problems with any of our control panels, please contact Customer Services to let us know.
We have identified a critical issue with our PHP 5.2 load balancer which allows a connections to remain open indefinitely and is currently being overly used by one of our customers in what is effectively a denial of service attack. Our server team are creating a workaround for this problem as soon as possible and the fix will be rolled out as an emergency change shortly. During this time, customers may be seeing intermittent slow web responses from web-sites hosted on the PHP 5.2 platform.
Update 15:15: Apologies for the delay in update, this issue was resolved at 10:00.
We are currently investigating a configuration problem with our legacy FreeBSD cluster with customers stored on our /home6 mount-point. We should be able to resolve this shortly and apologise for any inconvenience caused. This issue started at 17:00.
Update 19:00: Our server team have brought the majority of affected sites back online and will be continuing to look into the issue until we are satisfied it is resolved. We have identified a problem with some of our customer site provisioning logic and are working to improve this and make it more robust.
Update 10:00 An issue has been brought to our attention on the PHP 5.2 cluster involving some of the servers which had created incorrect site provisioning. This has now been resolved by our server team.
Our server team have restarted a critical legacy customer storage server which was not responding.
Customers will have seen unresponsiveness over several dependent services such as web sites and shell access as a result. We apologise for this and are continuing to migrate away from this and on to our new HP/Lefthand platform.