Gradwell Internet for business people

An Update On Service Disruption - Weds 21st Jan

Since approx 1:20pm on Monday (19th) afternoon until midnight of Tuesday 20th, there has been on-going disruption to many of our services, including email, web hosting, broadband, our portal, some PBXs, and NewSIP.  We apologise for this disruption, and want to assure our customers that we are taking steps both to resolve it and to prevent future occurrences.

As of Wednesday 21st at noon, we believe that we have a stable platform delivering services to customers, although we are continuing work to fully resolve all the outstanding issues.

Most of our services are delivered through approximately 300 virtual servers, which run the applications and software that delivers service to customers. These sit inside an environment managed by VMWare, some enterprise datacentre management software, which controls, at present, 15 very high spec physical servers (delivering the equivalent CPU as 70 real servers). The storage for the servers and files is provided by three iSCSI disk array units, each having multiple connections to the servers.

On the afternoons of 20th and 21st we experienced a problem where by one of the connections to our storage became overloaded. This caused several of the 15 host servers to become overloaded as they tried to re-route the disk traffic. VMWare attempted to manage this and powered up the virtual machines on other hosts, but this caused further disk overload and exacerbated the problem.

We have identified the causes of these problems, and are currently working to resolve it.

  1. A number of our VMWare ESX servers are currently trying to connect to virtual storage that does not exist.  We are working with VMWare to discover why and to get this resolved.  Update 21st Jan 09: this has been resolved, and was traced to a reporting error in the ESX management tools.
  2. As a result of our maintenance work to upgrade the processors on our VMWare cluster on Sunday 19th, the network traffic between our VMWare ESX servers and our iSCSI-3 SAN has become unbalanced, causing too much traffic down one route instead of balancing the traffic down the multiple routes available.  We are currently rebalancing the network routes which we hope will resolve this problem, and are working with VMWare to investigate whether this is related to problem 3.  Update 21st Jan 09: the network routes have been rebalanced.
  3. When our VMWare ESX hosts are restarted because of problems 1 & 2, they are losing connectivity to one of our disk arrays “iSCSI-2″.  This is causing capacity problems on the remaining ESX hosts that can still connect to iSCSI-2.  We are working with VMWare to discover why and to get this resolved, and we have, at time of writing, nearly migrated all of the services off this storage array.

We will update this list if we discover any additional underlying problems behind this disruption.

To prevent these problems happening in future, we need to spread our storage out amongst a larger number of storage SANs, and we also need to have dedicated storage for each of our headline services.  We are installing additional storage units on Wednesday 21st, in the evening, and once that’s done we’ll be scheduling regular out of hours maintenance to rebalance our services across the new storage.

As at Wednesday lunch time, on 21st, we believe we have returned the load balancing on our storage to the stable settings we had on Friday, and we have nearly migrated services off our iSCSI2 disk array.

We also believe that one of the root causes of these and recent problems is that our VMWare cluster has grown at too rapid a rate and it’s configuration needs optimising. We began January 09 with some independent expertise and this project is coming to fruition, with the audit work complete and plans for the longer term reconfiguration of our platform being developed, so that we can continue to scale our systems to match the rapid growth of our clients.

Finally, we know that equipment needs to be matched with people and skills. In addition to our January project, we have recently completed hires in our systems administration team and we are pushing our financiers to make further committments for 2009 against our business plan so that we can be proactive in delivering further capacity later in the year.

We’d like to take this opportunity to thank all our customers for their help and support during this time.

Kind Regards

Stuart Herbert, Technical Manager & Peter Gradwell, Managing Director

Related Posts

  • No Related Post

Leave a Reply

Please do not post replies here, if you require customer services. Please raise a support request via email to support@gradwell.net or via the Gradwell Portal