Skip Navigation
Lemmy.ca's Main Community @lemmy.ca

Lemmy.ca downtime Jan 5th/6th - Whew, we're finally back!

Hey everyone, and happy new year!

Sorry about that super long downtime there. Yesterday (Sunday) morning at 10:03AM PST our server suffered a physical hardware failure, apparently a power supply failure. Unfortunately despite opening a ticket with our hosting vendor (OVH) a few minutes later and them claiming to have 24/7 support, nobody looked at our ticket until this morning when their phone support lines opened and I called them.

They've now replaced a defective power supply and we're back online, after ~26 hours of being offline. Some pretty disappointing response times, to put it nicely.

We're planning to move away from OVH at the end of this month, onto proper enterprise grade hardware that we own and control. This will give us a HUGE boost in server resources and allow us to scale for the foreseeable future, while also giving us the control to resolve problems like this much quicker. Expect another follow up post about this in the next couple weeks once I've put together the migration plan.

Timeline:

  • Jan 5th 10:03am PST - We get alerts to the server being non-responsive.
  • Jan 5th 10:05am PST - I pull up the console via IPMI and it's completely non-responsive. Attempting to power off / on the server or do anything, does not work.
  • Jan 5th 10:15am PST - Initial support ticket created with OVH. I followed up a couple times over the next few hours, and got no response.
  • Jan 6th 6:32am PST - Called OVH, gave them the case number and asked them to investigate
  • Jan 6th 7:34am PST - I get notified they'll start their "intervention" in 15 minutes.
  • Jan 6th 11:04am PST - Call them again, the tech is still working on it and they'll get back to me with an update
  • Jan 6th 11:34am PST - "I was informed by our data centre technician that there is an issue with the power supply unit for the rack on which your server resides. Your server will come back online once they have replaced the power supply."
  • Jan 6th 12:17pm PST - We're back up finally!

Edit on Jan 7th @ 8:40am PST: We just had another outage of about an hour. Investigating with OVH.

59 comments
  • Wow, that’s pretty shitty service, considering that renting a physical server is likely not a small client thing. I’m also floored that a rack PDU failure was not detected and acted upon more proactively by their datacenter operations team, and necessitated a ticket to be opened by you. OVH really does seem to be the Temu of colo/hosting providers. Yikes!

    • They have a nice status page where you can see how many servers are down in a rack, I think that was a mis-comm between the tech + support since it only showed 1-2 down. No way a whole rack pdu failure should be down for that long.

  • Pretty piss-poor "24 hr. service" from OVH. We're an MSP and will make sure to steer clear of them for future projects.

    Glad to see everything back however!

    • Took them 24 hours to respond and 7 minutes to fix. Sounds like 24/7 service to me?

      • Shit. I didn't read your comment as a joke initially and almost had a mini-rant about how we know if a server 3 provinces away is down for 5 minutes and go into a red alert.

        That was dumb of me. Carry on.

  • Thank you for all your hard work and top notch comms, Lemmy.ca admin peeps. I spent the day at dbzer0, which is always fun, but happy to be back home!

  • [Opens lemmy the 20th time today] Ahhhhhhhhhhh, that's the stuff. I was entirely too productive!

  • Oh my. I do feel your pain. My Friendica instance was down at the same time as your Lemmy instance, for 2 days as well, except my issue was a corrupted database engine and corrupted database tables. Kindred spirits.

    I'm glad you got it resolved!

  • Well, thanks to this outage, I got a pretty good understanding of how often I turn on my Lemmy app. It turns out to be a whole heck of a lot. Haha.

    Glad things are back up and running and that you all have a path forward to mitigate this in the future.

  • Welcome back! OVH sucks, really happy to hear you're moving away from them!

  • Thank you! .... This has become my landing page on the internet so it's great to see you guys up again.

    Thanks for all the work you guys do!

  • Thank you! Slow response from OVH - you'd think they'd have monitoring for that.

  • Welcome back everyone! I was sad to see our instance down but I'm grateful we had some good follow up by our administration.

59 comments