All systems operational

Billing Panel

100.0% uptime
Oct 2023 · 100.0%Nov · 100.0%Dec · 100.0%
Oct 2023100.0% uptime
Nov 2023100.0% uptime
Dec 2023100.0% uptime
DirectAdmin Panel
98.49% uptime
Oct 2023 · 100.0%Nov · 100.0%Dec · 95.58%
Oct 2023100.0% uptime
Nov 2023100.0% uptime
Dec 202395.58% uptime

IMAP

98.49% uptime
Oct 2023 · 100.0%Nov · 100.0%Dec · 95.58%
Oct 2023100.0% uptime
Nov 2023100.0% uptime
Dec 202395.58% uptime

SMTP

98.47% uptime
Oct 2023 · 99.93%Nov · 100.0%Dec · 95.58%
Oct 202399.93% uptime
Nov 2023100.0% uptime
Dec 202395.58% uptime
Webmail
98.49% uptime
Oct 2023 · 100.0%Nov · 100.0%Dec · 95.58%
Oct 2023100.0% uptime
Nov 2023100.0% uptime
Dec 202395.58% uptime

Support

100.0% uptime
Oct 2023 · 100.0%Nov · 100.0%Dec · 100.0%
Oct 2023100.0% uptime
Nov 2023100.0% uptime
Dec 2023100.0% uptime

Notice history

Dec 2023

lucy.mxrouting.net
  • Resolved
    Resolved

    The Lucy server is considered to have been restored. Here are some key points:

    1. While we're working to recover what we can of 1 week of email that was lost from restoring this backup, we are considering the data gone. If asked for it, as of right now, our answer would be that the data is lost. If we have any success in this, it will not be quick.

    2. We're not saying everything is working flawlessly, but we're not going to update the status page for individual spot fixes. The larger picture here is that the restore is done, and that everything else which needs to be done is per user and not something that needs to be fixed for the whole server collectively.

    3. Any further issues with the server should be in a support ticket. Going through them may not be a quick process. Requests to restore the last week of data will be referred back to #1 above.

  • Identified
    Update

    Things on my plate for the Lucy server today:

    • Ensure the last of the restores finish
    • Double check consistency of restores. At least 1 restore we know of finished with a bunch of missing email accounts. It may have been a one-time problem, I won't know until I dig further.
    • See about getting back email forwarders that JB didn't restore
    • Check if sieve filters restored properly (why shouldn't they have?)
  • Identified
    Update

    We're on the tail end of backup restores. We've restored 2563 of 2862 accounts. Here are some quick bullet points that might save you time and questions:

    1. At least 1 account was identified to have not properly restored. All of it's email accounts were missing. How many are like that? I hope 1. But answering that question and making it right is one of the line items in front of me.

    2. Customers are still reporting varying results with custom hostname SSL certs (mail/webmail/mailadmin subdomains). If you can fix this in DirectAdmin from your side, please do. Our attempts at fixing these while doing restores has resulted in creating temporary problems that overshadow this, we need to avoid further touching that from root until later. If our IP is rate limited by LetsEncrypt, I don't think we can fix that right now (their form says not eligible for increase).

    3. JB (software) doesn't appear to be restoring a single email forwarder. Not revisiting this until after restores finish.

    4. I (Jarland) am experiencing the worst kind of empathy. My inability to give immediate satisfaction to anyone who is currently begging me to fix something is upsetting me. It's important that I interact with customers sparingly right now, I notice that overdoing it is impacting my performance on restore/repair.

    5. Most of our users on the Lucy server are online and in wonderful condition. This isn't for your benefit as much as mine, sometimes I need to think about what is working as opposed to what isn't.

  • Identified
    Update

    We are continuing to finish the restores, but most customers are online. A couple of notes you might have missed from previous updates:

    1. This is the backup we restored to the previous Lucy server, the one that failed. Due to extremely heavy usage from remaining repair efforts and resellers kicking off backups, we did not get a chance to backup that server before it failed. Expect a week of email to be missing. We're trying to get it back, but we want you to expect it to be gone. Hope is not warranted in that effort.

    2. A lot of users are reporting custom SSL hostname issues. If you can fix it yourself in DirectAdmin that's great, but we're going to stop trying to fix them because we can't keep forcing additional Apache config reloads while the restores are doing the same, it terrorizes all of the users that have services online right now.

  • Identified
    Update

    Clear skies for most users on the lucy.mxrouting.net server. The remainder of restores are still ongoing, most issues preventing users from doing what they needed to do have been fixed.

  • Identified
    Update

    Restores are going quite well and they will definitely be complete today. Today we'll fix association between user and resellers, for users that do not appear connected to their resellers.

  • Identified
    Update

    Backup restores are going great and picking up speed. Most custom webmail subdomain (non-Crossbox) SSL certs are restored for the accounts that have been restored.

  • Identified
    Update

    Backup restores on lucy.mxrouting.net slowed down overnight due to an API failure in the restore of reseller accounts, causing failure to assign an IP to the reseller which then caused a failure to restore their user accounts, in the cases where the user account restores attempted after their reseller had been restored. We've corrected this and we're plowing through the restore lists again. Restore speeds should be faster now as even the ones that failed, more than half of the job (the compression of their data into a backup archive) had already finished.

  • Identified
    Update

    Inbound email has been re-enabled on Lucy.

  • Identified
    Update

    Here's the current state of the lucy.mxrouting.net server:

    1. We've restored 1,000 accounts as of this update. Most likely, all restores will be complete by the end of Monday (US/Central). Users can start using their accounts as they are restored. You can check if yours has been by searching your DA username here: https://gw.mxroute.com/lucy.php

    2. Inbound email will open back up at 11:00PM tonight, US/Central time.

    3. Your reseller DA account having been restored does not mean that your users have all been restored. We have backups broken up into 12 batches, each batch is restoring independently.

    4. Because some users may be restored prior to their reseller being restored, it's possible that some of your sub-users may not appear in your list in DA even if they have been restored. We'll fix that after restores, doing it while we're restoring removes our ability to take fully informed bulk actions and risks mistakes. It shouldn't stop your users from using their service, if their DA account has been restored.

    5. If your custom webmail/mail/mailadmin URLs do not have working SSL, you may need to reapply the steps for it. We corrected what caused this after restoring about 800 backups, and it's another situation where we don't want to take bulk action to fix it while restores are still happening, as it makes results unpredictable.

    6. The backup being restored is the last JetBackup copy of Lucy prior to the previous outage. The last week of email will not be included. We do hope to recover that from the previous Lucy server, but would prefer that you consider the data lost. Let it be a happy surprise if we recover it, but we don't think we will. It sucks, bad.

    7. The "new/old backup plan" we talked about after restoring Lucy was going to start it's first round several hours after this outage. With so many users running their own backups, and with us still fixing some things, we didn't want to bring the server down by hammering the disks even harder when we were already at 25%+ iowait. Right as we were settling to a new normal and would have been able to start the backups, we got slapped down.

    8. The good news is, if we have any worse luck none of us will care about this because it probably means nuclear war.

  • Identified
    Update

    State of lucy.mxrouting.net server:

    1. Still restoring backups.

    2. Not accepting new inbound mail, but will be in a few hours. This is to ensure that email out there waiting to retry delivery to you has the best chance at being received properly.

    3. You can see if your DirectAdmin user has been restored by inputting it's username here: https://gw.mxroute.com/lucy.php

    4. A reseller user having been restored doesn't mean their users were all restored yet. Users restored prior to the restore of their reseller may not be correctly linked, we'll fix that if it ends up being the case.

    5. Working on custom webmail subdomain SSL.

  • Identified
    Update

    Backup restores going strong. The speed of the backups cannot be calculated, don't take this report as an invitation to do the math, you will be wrong on it, but we're at 913 backups restored right now on the Lucy server.

    Inbound email will be re-enabled tonight. Remember that you can check your DirectAdmin username here to see if you've been restored: https://gw.mxroute.com/lucy.php

    If a reseller account has been restored, that doesn't mean that their sub-users have been restored. It's also possible that we need to re-parse the list of owned users when this is done to ensure that resellers see all of their users listed in DirectAdmin when they go looking for them, but that's not a problem to be addressed during this stage of the repair.

  • Identified
    Update

    Crossbox is reinstalled on the Lucy server. Of the 2862 backups to restore, 797 have been restored. Still going. These backup restores are more complete than the restore on Lucy2. Also, we're referring to the servers now as these:

    Lucy1 - Failed RAID controller, recovered OS, survived until chassis swap and then file system was hosed.

    Lucy2 - Failed RAID, reason not 100% proven but suspected as 1 bad sync + 1 bad disk

    Lucy3 - The one in production right now seeing backups restored to it, not accepting inbound email until backups finish restoring (to preserve inbound email sitting in retry queues)

    If we require a Lucy4, we're retiring the name Lucy and apologizing to whatever god we angered.

  • Identified
    Update

    The lucy.mxrouting.net server has a new IP for the moment, which is 94.130.135.140. If your DNS is using CNAME records to point to lucy.mxrouting.net, you DO NOT need to make ANY changes to your DNS. This is only for users who are well aware that they created A records for this, which is fine but not something we suggest or directly approve of.

    While you may see lucy.mxrouting.net online, we absolutely cannot accept new email into this server until more accounts have been restored. Doing so would mean the loss of all inbound email from the last 24 hours for users that have not yet been restored. You can check if your account has been restored by typing your DirectAdmin username into this form: https://gw.mxroute.com/lucy.php (Note that your reseller username being restored is not an indication that your sub-users have been restored, each DA username is considered it's own for this check).

    As soon as we've restored the accounts, we'll start accepting inbound email.

  • Identified
    Update

    We're increasing the number of simultaneous backup restores on the new Lucy server. You can check if your account has been restored yet by inputting your DirectAdmin username here: https://gw.mxroute.com/lucy.php

    That check will come in handy a bit later, for now it's not very useful.

  • Identified
    Update

    Backups are being restored on the server which was initially part of our Plan B from the last outage. This will mean a change of IP for the lucy.mxrouting.net server. An email will go out about that when it's time.

  • Identified
    Update

    While it cannot be conclusively proven, we believe that this is what happened to the RAID10 array on the Lucy server:

    1. One of the drives took too long to sync, and was kicked out of the RAID.
    2. Another drive failed, leaving us with 2 drives in a RAID10 array.

    We had not started monitoring the RAID or even started our new backup strategy with the server yet. We were still spot fixing issues from the previous restore, from customer tickets. While we are still working to see if we can recover any data from this server, we are moving forward with restoring a new server to the state that Lucy was previously restored to, after the most recent outage. That means that the server will be missing data which was written to the server after that last restore. Although we'd like to get that data back, we can't wait on that, we'll have to try to do that after.

    Restoration is now in progress.

  • Identified
    Update

    Disk cloning as an attempt to make the RAID controller appreciate it's drives like it's supposed to is the current path. There may be no reason to update here until that reaches another stage. These are 8TB disks we're cloning, just 2 of 4 right now.

  • Identified
    Update

    We're cloning disks and replacing them as a troubleshooting step.

  • Identified
    Update

    We're still watching the RAID array rebuild on the 4th disk.

  • Identified
    Update

    Rebuilding RAID array before going any further. Won’t be any update for a bit while we let that run.

  • Identified
    Update

    We are continuing to work on a fix for this incident.

  • Identified
    Update

    We're still working on this.

  • Identified
    Update

    Our hardware expert is working on the server, hope to have it back up in just a bit.

  • Identified
    Update

    Reseating the drives wasn't enough to help, so looking into cables attached to the RAID controller. This may take a bit more time. It's surely nothing deeper than that, but we can't just send any old remote hands in to do that task without having the hardware master available. We're waiting on confirmation from our hardware master that he's taking control of that effort.

  • Identified
    Identified

    The issue with the Lucy server resembles that of some of it's disks being removed from the server. This is a hardware RAID10 configuration, there shouldn't be any problem here which is fatal. That a completely different server which shared the same name as this one experienced a storage related failure recently is a coincidence worth writing a book about, but there can be absolutely no way that the events are connected.

    We're working on it, but we want to consult someone more senior on issues of this nature and be careful how we proceed, especially given what users on this server have recently been through.

  • Investigating
    Investigating

    We are currently investigating this incident. The server was found to be online but unresponsive by several key services. Permission errors were visible on the IPMI console, and a reboot lead us to an interesting puzzle. We're working on this. There should be no relation to any previous events on this server.

lucy.mxrouting.net
  • Completed
    December 10, 2023 at 10:12 PM
    Completed
    December 10, 2023 at 10:12 PM

    Crossbox is back online on Lucy. Aside from copying over any remaining email data from the backup server, this is considered resolved. Any remaining issues should be handled via support ticket.

  • Update
    December 08, 2023 at 6:55 AM
    In progress
    December 08, 2023 at 6:55 AM

    Lucy is online. Any email users that can't login should be able to in less than 1 hour. Please read the previous update for it's numbered points, they are very relevant. We're beginning the sync of old emails, that part is going to take some time and we'll provide less updates for it on this status page.

  • Update
    December 08, 2023 at 6:09 AM
    In progress
    December 08, 2023 at 6:09 AM

    We're about to set users loose on the Lucy server. Here's what you need to know:

    1. We can't turn on Crossbox tonight, the MySQL versions are incompatible. Use webmail.mxroute.com, lucy.mxrouting.net/webmail, or lucy.mxrouting.net/snappy if you need to use webmail.

    2. Your old emails aren't there. They're NOT GONE! Don't worry. The goal was to get you into your accounts so you can start sending/receiving mail. It's going to take longer to get your old emails back in place. It might take a few days, honestly. That's why we didn't want to wait for that to bring it online.

    3. There might be unexpected oddities. Hopefully they'll be few.

  • Update
    December 08, 2023 at 3:09 AM
    In progress
    December 08, 2023 at 3:09 AM

    As of right now there are 3 Lucy servers to reference:

    1. The dead one (Original Lucy)
    2. The one we're working to restore backups on (New Lucy)
    3. The one we're working on migrating Original Lucy to in an experimental way (Experimental Lucy)

    Either #2 or #3 is going to be the production Lucy server. If Experimental Lucy wins, it'll be online tonight and everyone will simply be missing all of the email they've already previously received, sent, etc. However, that email will reappear slowly as we sync it over. We're rooting for Experimental Lucy, but we're prepared to fallback to New Lucy. Confusing? Surely.

    As of right now Old Lucy is running xfs_repair again so we can mount the FS and prep it for selective rsync to rebuild the server skeleton (accounts, email users, etc) on Experimental Lucy. Currently, the IP for lucy.mxrouting.net points to Experimental Lucy, where we're installing a fresh copy of DirectAdmin in preparation for the rebuild of the skeleton. We have a script running in a loop to ensure a consistent block on port 25 so that when exim is installed, it doesn't start rejecting the inbound email that is waiting for you, and you can actually receive it when we finish the job.

    If Experimental Lucy wins the game, it may run without Crossbox for several more hours, but that's an acceptable temporary loss for having everyone back online tonight. So that's where we're at.

  • Update
    December 08, 2023 at 1:48 AM
    In progress
    December 08, 2023 at 1:48 AM

    An update on the two efforts for restore:

    1. Still working to restore backups to a new server in Germany (as that's where our backups are, same datacenter, fastest connection).

    2. We mount original Lucy's file system in a recovery ISO (as that hasn't yet failed in tests, though it might later), build a new server next to it, and first we rsync enough data to get everyone's accounts back online but NOT their emails. When everyone has their accounts back online, we begin syncing emails. It's a bold strategy, fingers crossed.

  • Update
    December 08, 2023 at 1:31 AM
    In progress
    December 08, 2023 at 1:31 AM

    It's finally time to lay the original Lucy server to rest. It's not coming back. But we're still putting in two efforts to see which one can beat the other in a race:

    1. Restore backups to new server in Germany.
    2. Build a new server next to the original Lucy, mount it's disk in a recovery ISO, and try to migrate it's data to the new server.

    Whichever effort wins the race gets the prize of being the production server.

  • Update
    December 07, 2023 at 8:14 PM
    In progress
    December 07, 2023 at 8:14 PM

    No updates to give from the previous one.

  • Update
    December 07, 2023 at 3:50 PM
    In progress
    December 07, 2023 at 3:50 PM

    More split effort today as the backup restore plans continue, the effort to revive the existing server continues as well. The transfer of backups to the new server is so much slower than imagined, simply because of the number of files (each email is 1 file). But we're still working on it.

  • Update
    December 07, 2023 at 6:28 AM
    In progress
    December 07, 2023 at 6:28 AM

    Still working on restoring backups. It's not that it isn't working, it's that it's a painfully slow process. We have to talk about that problem later though, the first priority is getting this online. While the backup effort is ongoing, we are still trying some radical things to bring the original server back online. It may be a long shot, but we're not leaving any angle of this on the table. Everything must be tested in order to bring the best and fastest resolution.

    Note that the schedule for this status update is not an ETA, at this point it's an arbitrary number chosen to keep the status open.

  • Update
    December 06, 2023 at 11:08 PM
    In progress
    December 06, 2023 at 11:08 PM

    Still working on restoring backups. Planning a full postmortem of this on blog.mxroute.com after the event.

  • Update
    December 06, 2023 at 8:01 PM
    In progress
    December 06, 2023 at 8:01 PM

    As we've continued working on the original Lucy server, we've had ups and downs. At one point we actually got it to ping, but the filesystem quickly went readonly and wouldn't mount again. We're still working on moving backups to a new server, independently of our attempts to repair the original server. An ETA would be irresponsible, there's simply no way to have any kind of estimate right now.

  • Update
    December 06, 2023 at 6:20 PM
    In progress
    December 06, 2023 at 6:20 PM

    Making progress. An ETA could only be isolated to "between 1 and 48 hours" so it's really not worth giving one yet.

  • Update
    December 06, 2023 at 3:09 PM
    In progress
    December 06, 2023 at 3:09 PM

    We are mostly considering the Lucy server to have failed, it's file system to be hosed. We are working on restoring backups. This isn't a fast process.

  • Update
    December 06, 2023 at 8:17 AM
    In progress
    December 06, 2023 at 8:17 AM

    Two efforts have spawned as a result of this maintenance failure.

    1. Working to bring Lucy online
    2. Working to restore backups to a new server

    If I could tell you which one I was more confident in completing first, I would be happy to do so. I cannot.

  • Update
    December 06, 2023 at 7:26 AM
    In progress
    December 06, 2023 at 7:26 AM

    We're still working on this.

  • Update
    December 06, 2023 at 5:51 AM
    In progress
    December 06, 2023 at 5:51 AM

    Still working on this.

  • Update
    December 06, 2023 at 5:05 AM
    In progress
    December 06, 2023 at 5:05 AM

    The server did not boot into it's OS as intended, despite booting fine on last reboot. We're working on it. Everything is fine, no inbound email will be missed, a chassis swap just sometimes comes with unexpected hurdles.

  • In progress
    December 06, 2023 at 3:45 AM
    In progress
    December 06, 2023 at 3:45 AM

    Maintenance is now in progress

  • Planned
    December 06, 2023 at 3:45 AM
    Planned
    December 06, 2023 at 3:45 AM

    We need to move the disks from this server into a new chassis, and we were unable to schedule this in advance due to dynamic availability. It should be roughly the same work as the previous reboot, and back online in no time. No inbound email will be missed, the whole thing shouldn't feel any worse than temporarily losing your cell phone signal as you drive down the road.

    While we have a 3 hour window set for this, it's expected to take about 10-15 minutes.

Nov 2023

friday.mxlogin.com

Oct 2023

Oct 2023 to Dec 2023