STOP 0x116 – If the card isn’t cooling it will STOP

I thought that I had solved the STOP 0x116 errors but it seems that I had merely delayed them.

After a spate of BSOD, I left the video card out of my main rig until I had some time for tinkering.  I use the GPU for BOINC processing, so those CUDA workunits just had to wait.  I replaced the card and within 5 minutes of starting the BSOD appeared.  Power down and restart and the same thing happened 5 minutes later.

Since the case is quite small, the double-width video card sits close to the bottom.  Had it been anywhere else, the dead fan (!) would have been obvious.  (I also had a noisy case fan that made diagnosis by ear impossible too.)  I couldn’t start the fan with a flick of a finger.  I removed the card and the fan and found that indeed no reasonable amount of force would spin the fan; it was DED.

Computer shops didn’t have a direct replacement (fair enough) and the only fans were case or CPU.  Jaycar had some interesting units that are both quiet and move bulk air.  So I bought a 120mm case fan that I could bolt to the existing fan shroud, to replace the 95mm original; quieter and more air flow.

First minor issue was the fan header on the card; somewhat smaller than a standard motherboard fan header.  No great drama as I could use the old lead to connect to the new fan socket.

Bigger issue was there being no way to line up the case fans mounting holes to any solid object!  Back to Jaycar for a smaller fan.

But at least I found the reason for the STOP 0x116 errors.

Fixing a dead fan on a video card

I tried a quick repair with an 80mm fan (exchanged at Jaycar for the 120mm fan).  I tried to clip it into the existing fan shroud, but it wasn’t going to fit easily or securely.  So I took some self-tapping metal screws at carefully screwed each corner into a suitable pair of fins on the heatsink.  I directed the airflow away from the heatsink to try to draw air across the GPU and RAM.

Case fan to replace a video card fan
Case fan to replace a video card fan

The case fan came with a 3-pin case fan plug to suit a motherboard socket, which is larger than the fan socket on the video card.  I cut the old fan leads, stripped the insulation, twisted and tinned the leads and squeezed them into the new fan plug.

Case fan into a video card fan header
Case fan into a video card fan header

Plugged in the new card, downloaded some utilities to confirm temperature and fan speed and instant success!  I even managed a BIOS update.

Advertisements

STOP 0x116 is annoying

What a pain. Since 22 February one of my PCs has been locking with no apparent error.  There was no tell-tale stop error recorded in Event Viewer to show which program had caused the crash.

My PCs run BOINC 24/7, so the CPU are running at 100% 100% of the time.  Each PC has an NVIDIA video card (a GeForce GTX 550 Ti and GTX 560 SE) running CUDA.  Note that the video cards are not connected to my monitor; their role is to process BOINC work units as fast as possible.  (BTW, I use a 4-way KVM switch with old RGB and PS/2 connectors with 1 keyboard, 1 monitor and 1 mouse.)  Windows 7 will happily run for weeks without reboot; my dual Quad-Xeon workstation ran for almost 2 months without restart.

And then one of my PCs starts crashing daily.

Are you crash-ready?

I always set up my Windows PCs to capture crash details, here’s how:

  1. Select Control Panel > System > Advanced system settings.
  2. On the Advanced tab, Starup and Recovery section, click the button
  3. In the System failure section;
    1. Check “Write an event to the system log” (the PC’s “black box” recording of the crash)
    2. Uncheck “Automatically restart” (You want time to read the Blue Screen of Death.)
    3. Select “Small memory dump (256KB)”
    4. The directory “%SystemRoot%\Minidump” will be C:\Windows\minidump in most cases.
StartupandRecovery
My System Failure preferences

In Event Viewer a STOP error points to the cause of failure.  The Event Viewer Windows Logs for Application and System can help pinpoint the time and cause of the crash.  But on these occasions, no STOP errors were recorded; the event logs just peter out.  After cold restart, the logs record only one  clue; the time of the crash in this quaint expression.

Log Name: System Source: EventLog Date: 1/03/2013 21:51:16 Event ID: 6008
Description: The previous system shutdown at 9:48:44 PM on ‎1/‎03/‎2013 was unexpected.

The minidump files have more information, which I use Nirsoft’s BlueScreenView to analyse.  The 0x00000116 STOP error is shown with files running at the time, with those suspected as being the cause.  It pointed to a display driver file.  However, updating the drivers and performing clean installs didn’t seem to make a difference.

Nirsoft's BlueScreenViewer.  Look at all of those 0x00000116 errors
Nirsoft’s BlueScreenViewer. Look at all of those 0x00000116 errors.  Look at how often they happened!

The 0x00000116 STOP error is a bit different to you common-or-garden crash.  MSDN describes it as follows, “The VIDEO_TDR_ ERROR bug check has a value of 0x00000116. This indicates that an attempt to reset the display driver and recover from a timeout failed.”  So it is more like a process falling asleep rather than crashing.

Just as I thought I had fixed it, my other PC crashed.  However, since the Startup and Recovery settings to automatically restart and record a kernel dump, which BlueScreenView can’t read, I can only speculate on its cause.  Great, 2 narcoleptic PCs.

Cause?

I narrowed the causes down to:

  • The video card was getting too warm.
  • The video card had not been properly configured.
  • The BIOS settings were wrong.

Solution?

Here’s what I think worked (and has worked so far):

  • The case is cramped and was a little dusty.  I removed the heat-sinks from the video card and CPU and vacuumed the case clean.
  • After putting the case back together, the rear system fans became very noisy.  Not long after I unplugged it the 0x00000116 error appeared.  Lubricating the fan motor and reattaching it has done the trick.
  • The video card had not been properly configured.  The NVIDIA control panel is disabled unless the card is plugged into a monitor.  Features such as PhysX were turned off and turning them on probably helps CUDA processing.  Maybe.
  • In BIOS I reverted to fail-safe options (I wasn’t far of standard anyway).  I disabled C1E, a feature that cuts the clock rate when the CPU is not fully utilised.  There’s a few BIOS features that I don’t need enabled such as virtualization that I’ll switch off later.

I don’t know for sure if the problem has been fixed, but it looks good so far.