MXroute - Lucy server – Incident details

All systems operational

Lucy server

Resolved
Operational
Started about 1 year agoLasted about 11 hours

Affected

DirectAdmin Panel

Operational from 6:57 PM to 5:50 AM

IMAP

Operational from 6:57 PM to 5:50 AM

SMTP

Operational from 6:57 PM to 5:50 AM

Webmail

Operational from 6:57 PM to 5:50 AM

Updates
  • Resolved
    Resolved

    Lucy is back online. A full account of events should follow a bit later.

  • Update
    Update

    There's an admin dream team working on this box right now with us.

  • Update
    Update

    We are continuing to work on a fix for this incident.

  • Update
    Update

    With activity split between getting the production server to boot it's OS, and deploying a replacement server, suddenly JetBackup doesn't work on a new server. So we opened a ticket with JetBackup. Whatever supernatural entity is responsible for this incredible sequence of events, even you have to sleep eventually.

  • Update
    Update

    We're still working on the OS. During this process, we've been preparing a replacement server to restore backups to, as we'd like that process to be moving along just in case we need it. If it's time wasted, and we sure hope it is, it was a worthy cause.

  • Update
    Update

    By all accounts we should either be online or right on the edge of being online. But the OS is still giving us hell. You can vent in tickets if you want, and we get it, but you won't speed this up. If we had a magic button we'd have pressed it already.

  • Update
    Update

    Hardware replaced, now it's just a matter of getting the OS to deal with the hardware changes and boot properly.

  • Update
    Update

    Still working on this. Note there is a reason it says 'Operational' above and it's too long to explain here, but no service on the lucy.mxrouting.net server is online until the text of these updates says otherwise.

  • Update
    Update

    RAID controller replacement didn't help. Was it replaced wrong? Is it the riser card? Stay tuned to find out.

  • Update
    Update

    We are continuing to work with remote hands on the raid controller replacement.

  • Update
    Update

    Frustration is understood. We're waiting to hear back from remote hands at the datacenter. Here are some facts:

    1. No inbound email will be missed, it will flow in after the server is back online.
    2. Many reasonable redundancies were in place here. It is not common for raid controllers to suddenly fail.
    3. This server is actively backed up, no matter what else happens.

    We expect to pop in the new controller, import the raid, and work fine under the expectation that this hardware configuration is perfect. This being a statistical anomaly alone, we don't expect a repeat event.

  • Update
    Update

    The server is not detecting the RAID controller. We're going to swap out the controller with the spare.

  • Identified
    Identified

    The Lucy server is having trouble booting, so that's interesting. We're working on it.

  • Investigating
    Investigating

    We are currently investigating this incident.