I'm at college right now, which is a 3 hour drive away from my home, where a server of mine is. I just have to ask my parents to turn it back on when the power goes out or it gets borked. I access it solely through RustDesk and Cloudflare Tunnels SSH (it's actually pretty cool, they have a web interface for it).
I have no car, so there's really no way to access it in case something catastrophic happens. I have to rely on hopes, prayers, and the power of a probably outdated Pop!_OS install. Totally doesn't stress me out I'll just say I like to live on the edge :^)
I like how posting got fairly fast. Then we started putting absurd amounts of ram into servers so now they're back to slow.
Like we have a high clock speed dual 32 core AMD server with 1TB of ram that takes at least 5 minutes to do it's RAM check. So every time you need to reboot you're just sitting there twiddling your thumbs waiting anxiously.
If you have the bandwidth... it is absolutely worth it to invest in a maintenance mode for your system, just check some flat file on disk for a flag before loading up a router or anything and then, if it's engaged, just send back a static html file with ye olde "under construction" picture.
That’s not really… possible at this point. We have thousands of customers (some very large ones, like A——n and G—-e and Wal___t) with tens or hundreds of millions of users, and even at lowest traffic periods do 60k+ queries per second.
This is the same MySQL instance I wrote about a while ago that hit the 16TiB table size limit (due to ext4 file system limitations) and caused a massive outage; worst I’ve been involved in during my 26 year career.
Every day I am shocked at our scale, considering my company is only like 90 engineers.
There's a package called molly-guard which will check to see if you are connected via ssh when you try to shut it down. If you are, it will ask you for the hostname of the system to make sure you're shutting down the right one.
Since you're using sudo, I suggest setting different passwords on production, remote, and personal systems. That way, you'll get a password error before a tired/distracted command executes in the wrong terminal.
Just having a multitude of terminals open with a mix of test environment and (just for comparison) an open connection to the production servers...
We were at a fair/exhibition once and on the first day people working on an actual customer project asked us, if they could compare with our code.
Obviously they flashed the wrong PLC and we were stuck dead at the first hours of the exhibition.
I still think that this place was cursed, as we also had to do multiple re-soldering of some connections of our robot and the sherry on top was the system flash dying - where I had fucked up, because I just finished everything late at night and didn't made a complete backup of everything.
But it seems, if luck runs out, you lose on all fronts.
At least I was able to restore everything in 20mins. Which must be some kind of record.
But I was shaking so much from the stress, that I couldn't efficiently type anymore and was lucky to have a colleague to just calmly enter what I told him to and with that we're able to get the show case up and running again.
Well, at least the beer afterwards tasted like the liquid of the gods
I was making after hours config changes on a pair of mostly-but-not-entirely redundant Cisco L3 switches which basically controlled the entire network at that location. While updating the running configs I mixed up which ssh session was which switch and accidentally gave both switches the same IP address, and before I noticed the error I copied the running config to the startup config.
Due to other limitations and the fact that these changes were to fix DNS issues (and therefore I couldn't rely on DNS to save me) I ended up keeping sshing in by IP until I got the right switch and trying to make the change before my session died due to dropped packets from the mucked up network situation I had created. That easily added a couple of hours of cleanup to the maintainence I was doing
Networking, we had a remote office in Europe (I'm in the US) and wanted to reset a phone. Phone was on port 10 of the Cisco switch, port 1 went to the firewall (not my design, already in place).
Helping my coworker, I tell her to shut port 10.
Shut port 1, enter.
Ok... office is offline and on another continent...
I work with IBM i/AS400 servers and those are not exactly the quickest thing to "reboot" (technically an IPL). Especially the old ones. I have access to the HMC/console but even this sometimes takes several minutes (if not dozens) just to show what's going on.
It's always a bit stressful to see the codes passing one after the other and then it stops on one and seems to get stuck there for a while before continuing the IPL process. Maybe it's applying PTFs (updates) or something, and you just have to wait while even the console is blank.
I've been monitoring those servers for years and I'm still sometimes wondering if it hanged during the IPL or if it's just doing its thing, because this part, even with codes, is not very verbose.
Fortunately it's also very stable so it pretty much always comes back a few minutes after you start wondering why the hell it's taking so long.
When someone previously told a vrtx vm not to auto boot after power up and none of the remote access is working either... Both undocumented as well, of course. And your tired AF tech is statically configuring the wrong IP range on their laptop to manu because it's been a long shutdown day and are also unfamiliar with the system in general (me). Good times, I figured it out though, but lots of sweating and swearing.
That's why you connect an arduino to the motherboards reset pin and load it with a program where it resets the system if it doesn't receive an ACK signal over the usb connection every 10 minutes.
Eventually though the networking and apache stops working after around 150 days so you also have to make a script that resets the system after 30 minutes of not having network.