Troubleshooting errorless system hang on kernel 6.7 (AMDGPU)
Edit: See update at end of post for more details (no solutions, but things to consider.)
As of Linux 6.7 I'm getting hard freezes that require a power cut to reset (sysrq doesn't work.) Happens at both idle and load anywhere from 5 minutes in to an hour. Running journalctl --follow and dmesg -w (both as root) reveal nothing at the time of the crash. Kernel version 6.6 continues to be 100% stable.
System:
Distro/Kernel: Arch Linux 6.7.arch3-1
CPU: AMD Ryzen 5 2600X
GPU: AMD RX580 8GB via AMDGPU
RAM: Some configuration of 16GB at 2667 MT/s.
WM: SwayWM
I'm unsure how to go about properly reporting a bug if no errors are being generated.
I've spent the last 5 days or so bisecting the kernel from stable 6.6 to 6.7 while also touching on linux-next and 6.8rc1. I've experienced hangs on each kernel after 6.6 but under different conditions. In some cases sys-rq can rescue the system, but other times it's a hard (still errorless) crash. I believe all of these crashes can be blamed on AMDGPU given all other user reports (see: reddit thread above) mention having an AMD card.
List of similar issues
Note: Some of these are from earlier Kernel versions, but they're included since they present the same way.
This bug report is the closest I can find on the issue. which links off to this report, which includes a patch. The patch in question is for 6.8rc1 and allows the system to stay up longer, but frequently "trips," meaning the system begins to stumble and halt for tenths of a second with accelerated video playback in my browser (qutebrowser) has to contend with load elsewhere on the system (gaming, etc.) Hangs under this patch and kernel can be rescued. However, with video acceleration disabled in the browser (and the browser not even running,) hard crashes can still occur. So either there's two new issues being brought into the fold (one to do with video accel. and the original issue mentioned in this post), or the original issue is just manifesting in different ways.
Bisecting 6.6 to 6.7
This process was taking forever because there's no reproducible situation in which the system halts. My method was to build the kernel (8 minutes) then reboot and let the system idle on a Sway session while I'm off doing something else (2 to 4 hours.) If I come back to a hard lock, then I mark the version as failed and repeat. This method let me try 2 versions of the kernel a day, not nearly enough to have this fixed quickly or easily. Due to the amount of time it takes to detect a crash, it's also possible to mark a bad commit as good (meaning it didn't crash in 2 hours, but would have in 4 or 5, etc.) I won't be continuing this.
The state of AMDGPU in general
It seems I'm lucky to have not had these issues before. A little bit of time reading issues like this show that people have been encountering problems for a while now (pre-kernel 6.7.) There are issues that have been open for years that still appear to be problems on modern systems and hardware. I'm not passing blame to anyone here, just stating that it's a miracle any of this stuff works at all given how complex the hardware is that even those who appear to be spending huge amounts of time dealing directly with it can't properly untangle what causes these faults.
If you can't sysrq then you're down to bisecting kernel releases to find the patch that introduced the issue. You could also review for any new features that are enabled by default in 6.7
I was afraid of that. Since I'm not the only one, maybe someone else is doing it already. But if it's still an issue in a few weeks, maybe I'll take it on as a weekend project. As for the motherboard, I believe the latest version is currently on it (2022 or 2023.)
I recently had an issue for quite some time where my computer would occasionally just hard crash. When it first started happening I tried many of the common tests including memcheck but found nothing. For a while it wasnt super common so I just lived through it. I thought it was an OS thing but it occurred on a different Linux distro and even on the ancient Windows 10 install I have but rarely use. I was just about to pull the trigger on replacing mobo and maybe even CPU+RAM. Before I did that I followed someone's suggestion to do a mem test. I could have at least sworn that I already did that and it came clean but it was an easy enough test to run, so why not.
Sure enough, found an error. I isolated the faulted DIMM, pulled it out and I haven't had a crash since. Crazy since I'm all but certain I did both memtest from a Linux live iso and the Windows memory checking utility.
In short, test your RAM. Do multiple passes. Maybe even just try swapping out single DIMMs and running on that for a reasonable ammount of time to see if you can isolate a culprit. It was my first thought when the issue first occurred because it's usually what causes stuff like that. When the tests came up clean originally I assumed it had to be something else. I was wrong.
This is what I'll try next. I do think memory is the problem now that I've had a few more hours of research. Kernel 6.7 has issues with elevated RAM usage, so it's absolutely doing something funky with memory that might be exposing underlying hardware issues. I also realized my stable kernel was a version or two away from 6.6.13 (6.6.10), so I'm running it now to see if the issue was introduced late in the 6.6 release cycle, which would be easier to bisect than 6.7.
This will be my last resort mostly because I'm fairly certain it's a kernel issue, but yes, I've never ran an extended memtest on this build and should probably let it run overnight at some point just to make sure.
In the comments of the web link you shared(The link you wrote didn't work for me but I looked up the original and adding it here so that others can choose to use their preferred libreddit or teddit) at least three comments mention that 6.7 zen kernel works fine for them. Care to try that ?
I recently had an issue with my computer freezing occasionally on a Deb12 Linux 6.1 where no errors showed up in syslog after a force reboot.
The way I finally found out about the issue was having dmesg open on a different monitor and waiting for the freeze to happen. Just before the freeze did happen, a number of error logs were spewed to dmesg - enough for me to catch a glimpse of the issue: intel WiFi.
I'm not saying that intel WiFi is your issue. I'm suggesting you keep dmesg -w open in another monitor (if you can) and go about your normal activity until a freeze happens.
I've never done it, but I would try reproducing this in a VM like qemu.... I would be googling at this point but I think you can debug a kernel crash from there somehow.
I did this recently and it was extremely quick to bisect and debug, but I was lucky enough to have a simple repro that worked in the emulator.
I think if I were you I'd try to repro on bleeding edge first. Then if it's still broken, I'd try to get the repro time down as much as possible and automate it. Then I'd either bisect on qemu if possible, or bare metal.