All systems operational

Resolved
lucy.mxrouting.net

Started
December 16, 2023 at 11:27 AM
Status
Resolved after 5 days

Impact

Partial outage
Affected
DirectAdmin Panel
IMAP
SMTP
Webmail
  • Resolved
    Resolved

    The Lucy server is considered to have been restored. Here are some key points:

    1. While we're working to recover what we can of 1 week of email that was lost from restoring this backup, we are considering the data gone. If asked for it, as of right now, our answer would be that the data is lost. If we have any success in this, it will not be quick.

    2. We're not saying everything is working flawlessly, but we're not going to update the status page for individual spot fixes. The larger picture here is that the restore is done, and that everything else which needs to be done is per user and not something that needs to be fixed for the whole server collectively.

    3. Any further issues with the server should be in a support ticket. Going through them may not be a quick process. Requests to restore the last week of data will be referred back to #1 above.

  • Identified
    Update

    Things on my plate for the Lucy server today:

    • Ensure the last of the restores finish
    • Double check consistency of restores. At least 1 restore we know of finished with a bunch of missing email accounts. It may have been a one-time problem, I won't know until I dig further.
    • See about getting back email forwarders that JB didn't restore
    • Check if sieve filters restored properly (why shouldn't they have?)
  • Identified
    Update

    We're on the tail end of backup restores. We've restored 2563 of 2862 accounts. Here are some quick bullet points that might save you time and questions:

    1. At least 1 account was identified to have not properly restored. All of it's email accounts were missing. How many are like that? I hope 1. But answering that question and making it right is one of the line items in front of me.

    2. Customers are still reporting varying results with custom hostname SSL certs (mail/webmail/mailadmin subdomains). If you can fix this in DirectAdmin from your side, please do. Our attempts at fixing these while doing restores has resulted in creating temporary problems that overshadow this, we need to avoid further touching that from root until later. If our IP is rate limited by LetsEncrypt, I don't think we can fix that right now (their form says not eligible for increase).

    3. JB (software) doesn't appear to be restoring a single email forwarder. Not revisiting this until after restores finish.

    4. I (Jarland) am experiencing the worst kind of empathy. My inability to give immediate satisfaction to anyone who is currently begging me to fix something is upsetting me. It's important that I interact with customers sparingly right now, I notice that overdoing it is impacting my performance on restore/repair.

    5. Most of our users on the Lucy server are online and in wonderful condition. This isn't for your benefit as much as mine, sometimes I need to think about what is working as opposed to what isn't.

  • Identified
    Update

    We are continuing to finish the restores, but most customers are online. A couple of notes you might have missed from previous updates:

    1. This is the backup we restored to the previous Lucy server, the one that failed. Due to extremely heavy usage from remaining repair efforts and resellers kicking off backups, we did not get a chance to backup that server before it failed. Expect a week of email to be missing. We're trying to get it back, but we want you to expect it to be gone. Hope is not warranted in that effort.

    2. A lot of users are reporting custom SSL hostname issues. If you can fix it yourself in DirectAdmin that's great, but we're going to stop trying to fix them because we can't keep forcing additional Apache config reloads while the restores are doing the same, it terrorizes all of the users that have services online right now.

  • Identified
    Update

    Clear skies for most users on the lucy.mxrouting.net server. The remainder of restores are still ongoing, most issues preventing users from doing what they needed to do have been fixed.

  • Identified
    Update

    Restores are going quite well and they will definitely be complete today. Today we'll fix association between user and resellers, for users that do not appear connected to their resellers.

  • Identified
    Update

    Backup restores are going great and picking up speed. Most custom webmail subdomain (non-Crossbox) SSL certs are restored for the accounts that have been restored.

  • Identified
    Update

    Backup restores on lucy.mxrouting.net slowed down overnight due to an API failure in the restore of reseller accounts, causing failure to assign an IP to the reseller which then caused a failure to restore their user accounts, in the cases where the user account restores attempted after their reseller had been restored. We've corrected this and we're plowing through the restore lists again. Restore speeds should be faster now as even the ones that failed, more than half of the job (the compression of their data into a backup archive) had already finished.

  • Identified
    Update

    Inbound email has been re-enabled on Lucy.

  • Identified
    Update

    Here's the current state of the lucy.mxrouting.net server:

    1. We've restored 1,000 accounts as of this update. Most likely, all restores will be complete by the end of Monday (US/Central). Users can start using their accounts as they are restored. You can check if yours has been by searching your DA username here: https://gw.mxroute.com/lucy.php

    2. Inbound email will open back up at 11:00PM tonight, US/Central time.

    3. Your reseller DA account having been restored does not mean that your users have all been restored. We have backups broken up into 12 batches, each batch is restoring independently.

    4. Because some users may be restored prior to their reseller being restored, it's possible that some of your sub-users may not appear in your list in DA even if they have been restored. We'll fix that after restores, doing it while we're restoring removes our ability to take fully informed bulk actions and risks mistakes. It shouldn't stop your users from using their service, if their DA account has been restored.

    5. If your custom webmail/mail/mailadmin URLs do not have working SSL, you may need to reapply the steps for it. We corrected what caused this after restoring about 800 backups, and it's another situation where we don't want to take bulk action to fix it while restores are still happening, as it makes results unpredictable.

    6. The backup being restored is the last JetBackup copy of Lucy prior to the previous outage. The last week of email will not be included. We do hope to recover that from the previous Lucy server, but would prefer that you consider the data lost. Let it be a happy surprise if we recover it, but we don't think we will. It sucks, bad.

    7. The "new/old backup plan" we talked about after restoring Lucy was going to start it's first round several hours after this outage. With so many users running their own backups, and with us still fixing some things, we didn't want to bring the server down by hammering the disks even harder when we were already at 25%+ iowait. Right as we were settling to a new normal and would have been able to start the backups, we got slapped down.

    8. The good news is, if we have any worse luck none of us will care about this because it probably means nuclear war.

  • Identified
    Update

    State of lucy.mxrouting.net server:

    1. Still restoring backups.

    2. Not accepting new inbound mail, but will be in a few hours. This is to ensure that email out there waiting to retry delivery to you has the best chance at being received properly.

    3. You can see if your DirectAdmin user has been restored by inputting it's username here: https://gw.mxroute.com/lucy.php

    4. A reseller user having been restored doesn't mean their users were all restored yet. Users restored prior to the restore of their reseller may not be correctly linked, we'll fix that if it ends up being the case.

    5. Working on custom webmail subdomain SSL.

  • Identified
    Update

    Backup restores going strong. The speed of the backups cannot be calculated, don't take this report as an invitation to do the math, you will be wrong on it, but we're at 913 backups restored right now on the Lucy server.

    Inbound email will be re-enabled tonight. Remember that you can check your DirectAdmin username here to see if you've been restored: https://gw.mxroute.com/lucy.php

    If a reseller account has been restored, that doesn't mean that their sub-users have been restored. It's also possible that we need to re-parse the list of owned users when this is done to ensure that resellers see all of their users listed in DirectAdmin when they go looking for them, but that's not a problem to be addressed during this stage of the repair.

  • Identified
    Update

    Crossbox is reinstalled on the Lucy server. Of the 2862 backups to restore, 797 have been restored. Still going. These backup restores are more complete than the restore on Lucy2. Also, we're referring to the servers now as these:

    Lucy1 - Failed RAID controller, recovered OS, survived until chassis swap and then file system was hosed.

    Lucy2 - Failed RAID, reason not 100% proven but suspected as 1 bad sync + 1 bad disk

    Lucy3 - The one in production right now seeing backups restored to it, not accepting inbound email until backups finish restoring (to preserve inbound email sitting in retry queues)

    If we require a Lucy4, we're retiring the name Lucy and apologizing to whatever god we angered.

  • Identified
    Update

    The lucy.mxrouting.net server has a new IP for the moment, which is 94.130.135.140. If your DNS is using CNAME records to point to lucy.mxrouting.net, you DO NOT need to make ANY changes to your DNS. This is only for users who are well aware that they created A records for this, which is fine but not something we suggest or directly approve of.

    While you may see lucy.mxrouting.net online, we absolutely cannot accept new email into this server until more accounts have been restored. Doing so would mean the loss of all inbound email from the last 24 hours for users that have not yet been restored. You can check if your account has been restored by typing your DirectAdmin username into this form: https://gw.mxroute.com/lucy.php (Note that your reseller username being restored is not an indication that your sub-users have been restored, each DA username is considered it's own for this check).

    As soon as we've restored the accounts, we'll start accepting inbound email.

  • Identified
    Update

    We're increasing the number of simultaneous backup restores on the new Lucy server. You can check if your account has been restored yet by inputting your DirectAdmin username here: https://gw.mxroute.com/lucy.php

    That check will come in handy a bit later, for now it's not very useful.

  • Identified
    Update

    Backups are being restored on the server which was initially part of our Plan B from the last outage. This will mean a change of IP for the lucy.mxrouting.net server. An email will go out about that when it's time.

  • Identified
    Update

    While it cannot be conclusively proven, we believe that this is what happened to the RAID10 array on the Lucy server:

    1. One of the drives took too long to sync, and was kicked out of the RAID.
    2. Another drive failed, leaving us with 2 drives in a RAID10 array.

    We had not started monitoring the RAID or even started our new backup strategy with the server yet. We were still spot fixing issues from the previous restore, from customer tickets. While we are still working to see if we can recover any data from this server, we are moving forward with restoring a new server to the state that Lucy was previously restored to, after the most recent outage. That means that the server will be missing data which was written to the server after that last restore. Although we'd like to get that data back, we can't wait on that, we'll have to try to do that after.

    Restoration is now in progress.

  • Identified
    Update

    Disk cloning as an attempt to make the RAID controller appreciate it's drives like it's supposed to is the current path. There may be no reason to update here until that reaches another stage. These are 8TB disks we're cloning, just 2 of 4 right now.

  • Identified
    Update

    We're cloning disks and replacing them as a troubleshooting step.

  • Identified
    Update

    We're still watching the RAID array rebuild on the 4th disk.

  • Identified
    Update

    Rebuilding RAID array before going any further. Won’t be any update for a bit while we let that run.

  • Identified
    Update

    We are continuing to work on a fix for this incident.

  • Identified
    Update

    We're still working on this.

  • Identified
    Update

    Our hardware expert is working on the server, hope to have it back up in just a bit.

  • Identified
    Update

    Reseating the drives wasn't enough to help, so looking into cables attached to the RAID controller. This may take a bit more time. It's surely nothing deeper than that, but we can't just send any old remote hands in to do that task without having the hardware master available. We're waiting on confirmation from our hardware master that he's taking control of that effort.

  • Identified
    Identified

    The issue with the Lucy server resembles that of some of it's disks being removed from the server. This is a hardware RAID10 configuration, there shouldn't be any problem here which is fatal. That a completely different server which shared the same name as this one experienced a storage related failure recently is a coincidence worth writing a book about, but there can be absolutely no way that the events are connected.

    We're working on it, but we want to consult someone more senior on issues of this nature and be careful how we proceed, especially given what users on this server have recently been through.

  • Investigating
    Investigating

    We are currently investigating this incident. The server was found to be online but unresponsive by several key services. Permission errors were visible on the IPMI console, and a reboot lead us to an interesting puzzle. We're working on this. There should be no relation to any previous events on this server.