Hello everyone. I have a system with Ryzen 9 7950x, 32GB 6400 mhz DDR5 ram, 1 TB primary SSD where Windows 11 and Linux installed and Gainward RTX 4080 graphics card and Asus Prime X670-P Wifi mobo. I also have 1 TB SSD and 2 TB HDD's mounted. 2-3 months ago, I started to get crashes on my both OS'es. And in time, they got frequent. I bought a brand new SSD for OS installation, after a while it started again. I cannot get any error message on Windows, since BSOD screen just stays for 2-3 seconds and system restarts. After restart, I sometimes get "no Bootable device found" error on boot stage. When the crash happens on Linux, dmesg outputs show something like whole SSD disconnected. It shows I/O messages for root partition as well. I changed primary SSD 1 month ago, errors still persist. Sent mobo to the service, no issues were found. BIOS also updated and reset. When I run PC on live Linux media, I get no issues however. What can I do else? What can cause this issue? Thanks in advance.
Once, I had an El Cheapo and very questionable SATA SSD fail on my system. Had similar symptoms, Windows would hang and crash at random, becoming more frequent over time. Found out while digging through Windows logs and troubleshooting, that the system would crash when trying to access the drive via the file explorer, because the drive would disconnect. The SSD seemed to fail slowly, but I was using it as a faster workspace and saving everything to an HDD, so I never looked into the possibility of a failing drive until the system wouldn't boot. Removing the drive cured everything. I should probably note that the failed SSD wasn't the boot drive, it was used strictly for data, so the OS wasn't being unmounted directly. I think the drive itself was shorting out some of the SATA pins, scrambling the whole bus.
Several years later, on the Linux side of things, I found out that fstab can prevent booting if a storage device is missing. Fstab had auto configured an external drive enclosure as a critical component on a fresh install. Not sure what the error messages would look like if an internal data drive mounted as critical disconnected on a running system, but I would assume Linux would halt even if no processes are running from the drive.
I'm not sure what the symptoms would have been if my SSD drive failed while running Linux. My gut says it would show similar to your Linux dmesg, like the boot drive I/O disconnecting or becoming inaccessible.
I've also had a system with an AMD processor fail to boot, but that one wouldn't even POST. Fixed that one by finally reseating the CPU. Turns out that's a common issue with some AMD CPUs using the AM4 socket, found a lot of complaints online for that one after the fact.
Since your system runs fine from a live USB, and you've already replaced the M.2 drive, I would try running the system without any SATA drives installed, and try to force a crash until you feel confident the issue is gone.
If the problems still persist, then I would look at getting a cheap fresh HDD and new SATA cable, installing a temporary OS, and try the test again.
If it STILL crashes, I would look at removing all unnecessary hardware from the motherboard and slowly testing each stage as you rebuild.
I unplugged SATA cables last night, booted from Windows USB to install it, SSD disconnected again mid course :) SSD is disconnected somehow and if it happens in OS installed on, it causes crash. On USB, there is no crash. It's not HDD, not memory or cpu, not SSD (it's brand new already). I'm down to motherboard at this point.