Issues with Taylor server
Partial OutageReported Mar 4, 2026 at 8:31 PM UTC · Resolved Mar 4, 2026 at 8:38 PM UTC · Duration: 6m
Affected Systems
Investigating server.
Postmortem
Let's get the facts out first, then we'll deal with the feelings. On Match 4th at 18:20 UTC the taylor.mxrouting.net server indicated that it's disk was full (it wasn't) and halted all write operations. It was resolved at 20:34. At 20:34 senders began seeing "Relay access denied" when sending mail to customers on the server, which was resolved at 21:48. The first issue appears to be that some software on the server held open an incredible amount of deleted data, causing the server to see roughly 400GB of storage consumed that was not. It wasn't until that service was restarted that the disk space was freed again. However, because these servers are so busy we weren't able to run lsof in a short amount of time that justified the trade off of continued issues until it finished (because it takes a very long time on a production system like this). We restarted all services on the box in a quick loop, it's unclear what application was doing this. The second issue is that DirectAdmin wrote a blank file to /etc/virtual/domains and /etc/virtual/domainowners, key files used to determine local recipient domains. With those files empty, no one on the server can receive any email and several other functions are broken. This isn't implausible, and likely isn't a bug, given the first issue. The primary reason this became an issue for customers rather than for us is that the disk alerting hadn't been properly configured for this server as intended (copy/paste monitoring profile, something failed long ago). Had that been active, we would have had time to figure out the issue before it became a customer facing problem, and that would have been the end of it. As it stands, the focus on restoring service functionality as quickly as possible resulted in the removal of the evidence required to determine the true root cause that kicked off the event. So we have two to-do items for this server: 1. Monitor for a similar event, catch it early enough that we can find the true root cause before it becomes a customer facing issue again, and then fix it again quickly before it does. Hopefully then we can fix the true root cause. 2. Work toward migrating this server to our new hardware builds, where we can say "good fucking luck" to any application trying to fill the remainder of storage space on the server with already deleted files. I'm obviously not happy that this happened. I'm trying to detach myself from how I feel about it and report the facts exactly as I know them. But if you want to see your server admin angry about downtime, oh believe me, that feeling is extremely present.
Updates
The OS was reserving an odd amount of disk and called it 100% utilized while not nearly being at 100% utilization. Fixed for now, but we're going to need to revisit this soon to make sure it doesn't happen again.
Investigating server.