Here are the details regarding the service interruptions that happened during the week of 11/28. We feel that in the interest of transparency, we are obligated to explain what caused the interruptions. I have provided a synopsis of the issue first and for those who want to know more, a more detailed version also. We understand your frustrations with these service outages. I hope that as you read the details below, it would become clear that these outages were caused by unusual confluence of events, some beyond our control. I want to assure you that many LTS staff members worked extremely hard to resolve the problems. We apologize for the inconveniences caused and we are in the process of developing procedures to protect against such outages in the future.
In short, the problems began on Monday morning (11/28) due to failure of a computer that authenticates our users. This resulted in Sakai slowness and My Wellesley failure. My Wellesley failure was a major one and took until 2 AM on Tuesday to recover, when the authentication issue also was resolved.
Sakai problems continue and Tuesday (11/29) afternoon, it was traced to a network problem with Global Crossing (a network provider) in Chicago, who is an intermediary between our network provider and Longsight (where Sakai runs). They refuse to acknowledge that there is a problem. On Wed 11/30, we ask Longsight to move Sakai to a different location to avoid going through Global Crossing.
If you are interested in the longer version, you can continue. If you have any questions or suggestions, please let me know.
From: Ravi Ravishanker, CIO, LTS, 860-631-RAVI
Monday 11/28 morning
One of the devices, called the primary domain controller, failed. We have two other backup controllers active, but My Wellesley failed to use these and was in a hung state. Sakai is programmed to go to the backups only after trying the primary first. This resulted in very slow logins. Longsight, the Sakai support vendor, reprogrammed Sakai to go directly to the backup controllers. Things improved somewhat, but Longsight was still seeing sporadic network issues.
Monday 11/28 afternoon/night
LTS staff began the process of moving the primary domain controller to new hardware. This took several hours. Rebooting My Wellesley was not successful due to operating system corruption. After several hours of troubleshooting and restoring the server to previous backup versions, our staff determined the server needed to be rebuilt. By 2am on Tuesday, both the primary domain controller and My Wellesley were up and fully functional. There was a strong correlation between these campus issues and Sakai slowness, so after a few checks of Sakai it was determined that everything was okay.
Tuesday 11/29 morning
Things “seemed” a bit more stable with Sakai. By mid-morning, we saw postings about Sakai failure in Google groups and we received several Help Desk calls. At 2pm, we held a conference call with Longsight where the sporadic network issue was traced to the connection between an internet provider called Global Crossing and our network provider Lightower. Unfortunately, this issue is something that only Longsight can see because our path back to them traverses a different path.
Longsight also informed us that using additional tools they were able to analyze historical data and found that this problem may have begun as early as 11/21. With this information, it became clear that the Sakai issues we were seeing was NOT correlated to our domain controller issue. We immediately escalated this to Lightower, our network provider. They then found out that many other customers of theirs were also affected by the same issue.
Wednesday 11/30 morning
Unfortunately, Global Crossing is an intermediary and neither Longsight nor we can contact them directly. It was frustrating to everyone that Global Crossing refused to accept that there were issues. We ask Longsight to move our instance of Sakai to their Chicago Data Center so that our traffic will not pass through Global Crossing. They do this successfully, around 8am. Longsight gave up on Global Crossing and later in the day on Wednesday, they redirected traffic to avoid Global Crossing.
Wednesday 11/30 evening
Global Crossing finally acknowledges that they have a network issue and begin looking into it. By this time, our network traffic has been rerouted around them, so we are less concerned.
Based on a follow up discussion, we have put in place a different problem resolution procedure which primarily involves stronger communication among all the staff involved, and clear and swift decision-making.