Difference between revisions of "General ECE service outage (September 14, 2017)"

From ECE Information Technology Services
Jump to navigationJump to search
(Created page with 'From 2017-09-14 11:10 to approximately 15:30, most ECE services were down, including home directories, SMB shares, IMAP, Webmail, and websites. Services should be working now; p…')
 
m
 
Line 3: Line 3:
 
Linux workstations using NFS-mounted home directories may need to be rebooted.
 
Linux workstations using NFS-mounted home directories may need to be rebooted.
  
== Technical explanation
+
== Technical explanation ==
  
 
Most ECE servers rely on our NetApp filer in some way, whether for content, configuration, or for input/output.
 
Most ECE servers rely on our NetApp filer in some way, whether for content, configuration, or for input/output.

Latest revision as of 16:07, 14 September 2017

From 2017-09-14 11:10 to approximately 15:30, most ECE services were down, including home directories, SMB shares, IMAP, Webmail, and websites. Services should be working now; please report any persisting abnormalities. Queued inbound mail should be have been delivered now.

Linux workstations using NFS-mounted home directories may need to be rebooted.

Technical explanation

Most ECE servers rely on our NetApp filer in some way, whether for content, configuration, or for input/output.

The NetApp relies on LDAP netgroups to obtain the list of clients that are allowed to mount each NFS share.

On 2017-09-13, at 22:51, we made a routine edit to add a member to a netgroup. Due to a failure in the automated process to sync the change to our LDAP servers, the process ended up clearing all netgroup entries instead.

Due to the way that the NetApp aggressively caches netgroup entries, the failure went unnoticed until 2017-09-14 11:10, when the NetApp started denying access to all NFS clients. Although we had restored the missing netgroup definitions in our LDAP server, the NetApp would continue to deny access to clients until the cached bad netgroup entries could be flushed. This happened around 15:30. Thereafter, we had to reboot various servers to restore functionality.