My PCs run BOINC 24/7, so the CPU are running at 100% 100% of the time. Each PC has an NVIDIA video card (a GeForce GTX 550 Ti and GTX 560 SE) running CUDA. Note that the video cards are not connected to my monitor; their role is to process BOINC work units as fast as possible. (BTW, I use a 4-way KVM switch with old RGB and PS/2 connectors with 1 keyboard, 1 monitor and 1 mouse.) Windows 7 will happily run for weeks without reboot; my dual Quad-Xeon workstation ran for almost 2 months without restart.
And then one of my PCs starts crashing daily.
Are you crash-ready?
I always set up my Windows PCs to capture crash details, here’s how:
- Select Control Panel > System > Advanced system settings.
- On the Advanced tab, Starup and Recovery section, click the button
- In the System failure section;
- Check “Write an event to the system log” (the PC’s “black box” recording of the crash)
- Uncheck “Automatically restart” (You want time to read the Blue Screen of Death.)
- Select “Small memory dump (256KB)”
- The directory “%SystemRoot%\Minidump” will be C:\Windows\minidump in most cases.
In Event Viewer a STOP error points to the cause of failure. The Event Viewer Windows Logs for Application and System can help pinpoint the time and cause of the crash. But on these occasions, no STOP errors were recorded; the event logs just peter out. After cold restart, the logs record only one clue; the time of the crash in this quaint expression.
Log Name: System Source: EventLog Date: 1/03/2013 21:51:16 Event ID: 6008
Description: The previous system shutdown at 9:48:44 PM on 1/03/2013 was unexpected.
The minidump files have more information, which I use Nirsoft’s BlueScreenView to analyse. The 0x00000116 STOP error is shown with files running at the time, with those suspected as being the cause. It pointed to a display driver file. However, updating the drivers and performing clean installs didn’t seem to make a difference.
The 0x00000116 STOP error is a bit different to you common-or-garden crash. MSDN describes it as follows, “The VIDEO_TDR_ ERROR bug check has a value of 0x00000116. This indicates that an attempt to reset the display driver and recover from a timeout failed.” So it is more like a process falling asleep rather than crashing.
Just as I thought I had fixed it, my other PC crashed. However, since the Startup and Recovery settings to automatically restart and record a kernel dump, which BlueScreenView can’t read, I can only speculate on its cause. Great, 2 narcoleptic PCs.
I narrowed the causes down to:
- The video card was getting too warm.
- The video card had not been properly configured.
- The BIOS settings were wrong.
Here’s what I think worked (and has worked so far):
- The case is cramped and was a little dusty. I removed the heat-sinks from the video card and CPU and vacuumed the case clean.
- After putting the case back together, the rear system fans became very noisy. Not long after I unplugged it the 0x00000116 error appeared. Lubricating the fan motor and reattaching it has done the trick.
- The video card had not been properly configured. The NVIDIA control panel is disabled unless the card is plugged into a monitor. Features such as PhysX were turned off and turning them on probably helps CUDA processing. Maybe.
- In BIOS I reverted to fail-safe options (I wasn’t far of standard anyway). I disabled C1E, a feature that cuts the clock rate when the CPU is not fully utilised. There’s a few BIOS features that I don’t need enabled such as virtualization that I’ll switch off later.
I don’t know for sure if the problem has been fixed, but it looks good so far.