DNS outage (Jun 2013)

On 2013-06-17 at approximately 17:55, one of the DNS servers at ICICS stopped responding. Due to an unfortunate configuration of load balancers, this caused a cascading failure of DNS service for ECE as well. The problem was resolved at 18:40 by restarting the affected nameservers.

Many ECE services were affected during the outage:

  • Authentication attempts (or any LDAP lookups) would all fail.
  • Mail delivery was postponed. Mail retrieval and submission was unavailable.
  • Any services relying on the ECE database server were unavailable. This included www.ece.ubc.ca, help.ece.ubc.ca, and ECE's RT ticket system.

Steps to decouple ECE's DNS service from the ICICS infrastructure have been taken to avoid reoccurrence of the problem.