After running fine for two years my Ryzen 5950x CPU became unstable and started throwing random reboots. No errors are reported, and Windows Event Viewer just shows a kernel power loss event. The distinguishing feature is that the reboots usually happen when the CPU is lightly loaded or idle, or a few seconds after leaving a heavy workload. The machine passes stress tests without an issue.
Trolling the internet, this is a common problem across AMD's 3000x, 5000x and (sadly) new 7000x CPUs. There's no official acknowledgement of the problem by AMD (sound familiar?), and there doesn't seem to be a single reliable fix. The consensus seems to be that the CPU is undervolting below tolerance under light/idle workloads. So people are experimenting with many different BIOS settings to try and fix it.
With my own machine, I first had problems after installing the 5950x, which I eventually resolved by fixing the voltage for the DRAM (just taking it off 'auto' and setting it to the default value of 1.2V). It ran fine for the next two years. Then last week the battery in my UPS died so I plugged the machine directly into the wall power, and the reboots were suddenly back with a vengeance. It was unstable even on default bios settings.
Many BIOS fixes have been proposed, but it seems that different settings work for different people. So, I would like to document what got my machine running again and a path out of this mess:
- Initial work around: Disable Core Performance Boost: This single change immediately stopped the reboots and gave me back a stable machine so that I could start working on a more permanent solution. But you don't want to leave it like this because you lose around 25% performance as the CPU won't boost anymore.
- Initial fix: Set Precision Boost Overdrive (PBO) to 'advanced' to grant access to the Curve Optimiser. I set an all-core positive adjustment of +3, which represents a slight voltage increase across the curve. This enabled me to re-enable Core Performance Boost while maintaining stability of the machine.
- Permanent fix (in progress): While the above works, I don't like overvolting my CPU, which also degrades performance due to heat. So, I ran the Ryzen Master utility to get suggested overclocking (= undervolting) values for each core independently. It proposed -29 on every core except one, which was -28. This scheme was not stable, but I plan to re-apply the +3 adjustment to blocks of cores to see if I can isolate the problem to a particular area. Hopefully it's just one bad core that needs extra voltage and I can undervolt the rest. Check out some videos on Ryzen Master to see how it works.
Testing to identify problematic cores
With 16 cores this is going to take some time. So I'll begin by applying +3 to half the cores, while maintaining a small negative offset or zero on the rest. If the machine crashes I'll reverse the adjustments and see if that makes it stable. If this confirms the result I'll apply +3 to half the suspect cores and so on, progressively narrowing the search. If only one core is involved, I should be able to find it after four iterations.
I hope this helps someone else solve this most infuriating problem. But I'm going to take AMD to task here - there are sooo many reports on the internet, including on AMD's own forums, that they well know about this issue. So why is there no official acknowledgement? Some guidance on what to do about it? And why has the problem been allowed to persist over three generations of CPU?
After the Ryzen Master proposal proved unstable, I re-applied a +3 adjustment to the single core that had given a slightly different result, while maintaining the -29 undervolt on all the others. The machine hasn't crashed for two days. It's too early to call it 'stable' but it is definitely a huge improvement over crashing every 15 minutes. Tentatively, it looks like my CPU has one under-performing core, although it seems too good to be true to find it this easily.
If this configuration holds up for a few more days I'll start experimenting again and see what happens if I lower the adjustment on the suspect core. Apparently each step in Core Optimiser represents 3-5 mV so +3 is actually a pretty small change.
So why does this work, when making larger changes to Vcore directly don't? I have no idea. But I'm guessing that adjusting the voltage across the entire curve does a better job of propping up the low end under idle conditions.
I will note here that although my machine seems stable thus far on -29, the videos I watched on Ryzen Master recommended testing something more conservative like -10 first, and generally found they couldn't manage large undervolts using all-core adjustments. I'm guessing that the presence of a weaker core or two will limit all-core adjustments to whatever the weak core(s) are capable of, and that you will get better results making adjustments on a per-core basis.
Copyright, all rights reserved.