We received notice this morning that webmail had been down on London for some time. As we only monitor SMTP/IMAP, we had no alerts in place to notify of this. It had happened once before, but the fix put in place was done with high confidence of resolving it (moving Apache from prefork to event with PHP-FPM).

Our logs show the last hit to webmail at 10:13:21, with the next entry being after reboot at 19:44:58. Webmail was down for roughly 9 hours according to our logs.

The cause is still not entirely clear, as there is simply a gap in log entries and we have to work from a lack of information. However, this behavior is similar to Apache being overloaded (though generally a state reserved for prefork in our experience). So we’ve tuned the configuration for the Apache MPM to 2x for each variable, and we’ve added a monitor on London’s webmail service to alert us to an issue.

Jan 29th, 2020 13:21 CST

Online but investigating cause.

Jan 29th, 2020 12:55 CST

This morning London webmail went down. We were unable to resolve it without rebooting, which took longer than expected. A post-mortem will be posted here once the problem has been understood and mitigated. The service is back online at this time, this status is only being held open to post more information later.

Jan 29th, 2020 12:49 CST