Gradwell Internet for business people

Gradwell Service News

News

RESOLVED: Inbound Call Issues

We are currently investigating an issue on our inbound sip routers which will be stopping inbound calls to our network. We are working with the highest priority to clear the issue and we apologise for any inconvenience caused.

UPDATE: This issue has now been cleared by our system admins.

RESOLVED: VoIP Issues

We are currently seeing issues with two of our PBX servers, lon-pbx-5 and lon-pbx-2. Our system admins are looking into these problems now.

We apologise for any inconvenience caused.

Update: This issue has now been cleared by our system admins and we will continue to monitor the situation.

New Brand Launch

On Tuesday 26th May 2009, we will be launching our new branding across all of our websites.  As part of the brand launch, we’re moving our websites to gradwell.com, and we’re introducing an updated look to both our main website and our control panels featuring our new ‘g’ logo.

We hope you’ll join us on Tuesday to celebrate our new brand launch!

Update Tuesday 26th May @ 9:00am: We’re delaying the launch of our new website until w/c 1st June. It is looking great, but we’re resisting the temptation to put it live before testing is completed. Keep your eyes out for more details next week.

RESOLVED: Support services

We wish to advise customers that due to a flu virus, our support team is running at a reduced capacity and customers may experience long delays for telephone support. We would recommend submitting urgent issues via email to support@gradwell.net and we will try to address the issue as soon as possible. We expect to be back to normal capacity on Monday.

An Update On Service Disruption - 24th Jan 2009

I want to let our customers know about our progress in dealing with the service disruption that has been experienced since Monday.  This disruption has been caused by issues between VMWare ESX and our iSCSI storage, and we’ve taken the following steps to reduce and stop the service disruption.

  • We have upgraded the RAM and firmware inside our storage unit called ‘iSCSI-2′, and it is now back in service.  We are currently moving some services across from iSCSI-3, and once we have reduced the load on iSCSI-3 enough, there should be no more disruption.  (The root cause of the disruption is trying to do too much with iSCSI-3, because we lost iSCSI-2 for a while).
  • We have resolved the other configuration issues with our ESX cluster.
  • We were unable to install additional storage on Wednesday.  Unfortunately, the hardware that we were reprovisioning didn’t work reliably enough during our testing; we’ve had to abandon this plan for now.
  • Due to stock shortages, it will be Wednesday 28th Jan before we take delivery of the 4 x HP DL160 servers we’ve ordered to use as even more additional storage.  We’re aiming to install these a week on Sunday (Sun 1st Feb).

We will keep you informed as we make more progress with building and installing the additional storage.

We are conducting a lessons learned exercise early next week, and we will publish the results of this on here once that exercise is complete.

An Update On Service Disruption - Weds 21st Jan

Since approx 1:20pm on Monday (19th) afternoon until midnight of Tuesday 20th, there has been on-going disruption to many of our services, including email, web hosting, broadband, our portal, some PBXs, and NewSIP.  We apologise for this disruption, and want to assure our customers that we are taking steps both to resolve it and to prevent future occurrences.

As of Wednesday 21st at noon, we believe that we have a stable platform delivering services to customers, although we are continuing work to fully resolve all the outstanding issues.

Most of our services are delivered through approximately 300 virtual servers, which run the applications and software that delivers service to customers. These sit inside an environment managed by VMWare, some enterprise datacentre management software, which controls, at present, 15 very high spec physical servers (delivering the equivalent CPU as 70 real servers). The storage for the servers and files is provided by three iSCSI disk array units, each having multiple connections to the servers.

On the afternoons of 20th and 21st we experienced a problem where by one of the connections to our storage became overloaded. This caused several of the 15 host servers to become overloaded as they tried to re-route the disk traffic. VMWare attempted to manage this and powered up the virtual machines on other hosts, but this caused further disk overload and exacerbated the problem.

We have identified the causes of these problems, and are currently working to resolve it.

  1. A number of our VMWare ESX servers are currently trying to connect to virtual storage that does not exist.  We are working with VMWare to discover why and to get this resolved.  Update 21st Jan 09: this has been resolved, and was traced to a reporting error in the ESX management tools.
  2. As a result of our maintenance work to upgrade the processors on our VMWare cluster on Sunday 19th, the network traffic between our VMWare ESX servers and our iSCSI-3 SAN has become unbalanced, causing too much traffic down one route instead of balancing the traffic down the multiple routes available.  We are currently rebalancing the network routes which we hope will resolve this problem, and are working with VMWare to investigate whether this is related to problem 3.  Update 21st Jan 09: the network routes have been rebalanced.
  3. When our VMWare ESX hosts are restarted because of problems 1 & 2, they are losing connectivity to one of our disk arrays “iSCSI-2″.  This is causing capacity problems on the remaining ESX hosts that can still connect to iSCSI-2.  We are working with VMWare to discover why and to get this resolved, and we have, at time of writing, nearly migrated all of the services off this storage array.

We will update this list if we discover any additional underlying problems behind this disruption.

To prevent these problems happening in future, we need to spread our storage out amongst a larger number of storage SANs, and we also need to have dedicated storage for each of our headline services.  We are installing additional storage units on Wednesday 21st, in the evening, and once that’s done we’ll be scheduling regular out of hours maintenance to rebalance our services across the new storage.

As at Wednesday lunch time, on 21st, we believe we have returned the load balancing on our storage to the stable settings we had on Friday, and we have nearly migrated services off our iSCSI2 disk array.

We also believe that one of the root causes of these and recent problems is that our VMWare cluster has grown at too rapid a rate and it’s configuration needs optimising. We began January 09 with some independent expertise and this project is coming to fruition, with the audit work complete and plans for the longer term reconfiguration of our platform being developed, so that we can continue to scale our systems to match the rapid growth of our clients.

Finally, we know that equipment needs to be matched with people and skills. In addition to our January project, we have recently completed hires in our systems administration team and we are pushing our financiers to make further committments for 2009 against our business plan so that we can be proactive in delivering further capacity later in the year.

We’d like to take this opportunity to thank all our customers for their help and support during this time.

Kind Regards

Stuart Herbert, Technical Manager & Peter Gradwell, Managing Director

gradwell Opening Hours for Friday 12th December 2008

We wish to advise customers that on Friday 12th December, from
2:30pm, our offices will be closed and we will operate using minimal
support staff, during our annual christmas get together.

Our messaging service will continue to relay messages to our duty
support technicians, and we will naturally respond to all service
affecting issues.

Our normal hours of operation will be resumed on Saturday 13th December, where technical support will be available from 9am.