ADDED: 08:30 24th Sept
We have been experiencing intermittent periods of poor network performance over the last 24 hours.
This is being caused by a fault that has developed on our main cisco switch stack in which it’s CPU is running at 100%, typically for intermittent bursts of 4-5 minutes.
We have esclated this issue to the Cisco Assistance Centre and we are anticipating their diagnosis this morning. We will then be able to move forward.
If hardware replacement is necessary, we can complete this within 4 hours, likely at the end of Thursday’s business day.
Please accept our apologies for this fault and be assured that it is being pursued.
UPDATED: 09:30 24th Sept
Our network managment suppliers have escalated our Cisco case to “P1″ status, and are currently waiting on their response.
We have also lowered the CPU load on our core switches which has significantly reduced the packet loss on the network, and we are currently taking steps to isolate our voice network onto seperate routing equipment.
UPDATED: 10:45 24th Sept
Our network is fairly stable, but we are still seeing some issues with call quality and broadband speed.
Cisco are currently working on a resource exhaustion theory, where by the reload of our switch last wednesday has triggered a bug that has manifest itself slowly over the week. We are currently probing their diagnosis, but there is a possibility that a further switch outage of approximately 10 minutes later today may be necessary in order to correct the problem.
UPDATED: 13:00 24th Sept
Following expert advice from Cisco and our network management consultants we have identified that the problem is due to memory allocation (TCAM profile) on the Cisco switches we use.
This meant that if our network was being scanned (for example by a virus or potential intruder) then part of the processor ran out of memory and caused CPU load to become very high, causing problems with the switches performance.
We have disabled a number of features on these which has made more memory available and will monitor the switch performance this afternoon, as it currently seems stable.
It may be necessary to reboot the switch, but we can now perform this out of hours.
We will update this status message again at 18:00 unless a further problem develops.
Update 18:00: We believe that this issue is now fully resolved after configuration changes were made around 13:00 this afternoon, and we are happy that the core network is now stable.
Update 20:20: We are currently still experimencing problems with our core network and are awaiting urgent information from our network engineers.
Update 20:40: Our network team are restoring our switch configuration to the known working configuration prior to last Wednesday’s scheduled maintenance and will be performing an emergency reboot of the switch shortly, which will result in a brief outage.
Update: 21:25: We have seen a decrease in latency over our core network since 20:45, however, some packet loss is continuing. Our network team will restart our switches with a reverted configuration within the next few minutes.
Update 21:45: Our network team have now restarted our core switches with an older configuration. All services should be working normally and our internal monitoring is reporting success for all services.