A couple of days ago one of my team reported that one of our Windows servers had just BSOD’d just as they were sliding the monitor out of the rack. This particular Server 2003 box runs as our main licence server here at the University for the numerous and varied specialist applications that have their own licence managers (such as FlexNet / FLEXlm) and therefore we have a fair number of dongles from various suppliers that work as part of the licensing mechanism. The server was set to auto-restart after a crash so once it was back up the investigation started and after tasking the same team member with checking that the server seemed in reasonably good health and that the licence managers were working (might be the Summer vacation but we still have some students and academics around the place) I started the work to try and establish what had triggered BSOD.
The event logs showed the crash but there were no entries immediately prior that would indicate something bad was about to happen. Next I grabbed the crash dump file from the machine. The server was configured to create a kernel dump so c:\windows\memory.dmp was a manageable 114MB. After loading it up into WinDbg the initial output confirmed the bugcheck code that I’d seen in the event log but running…
!analyze –v
…gave me something much more useful. At the top of the output some more info on the bugcheck was displayed:
This was the first time that I’d seen a PNP_DETECTED_FATAL_ERROR bugcheck but the output gave me some idea about what had gone on. Given that the output pointed to duplicate PDOs I wondered if maybe one of the USB licence dongles, or the hardware it had attached to had fallen over and then come back causing the PnP manager to enumerate the new device before the old entry had been tidied up (maybe the PnP’s surprise-remove remove command hadn’t been actioned properly? surprise-remove is what kicks in if you remove a USB memory stick for example without doing the ‘Safely Remove Hardware’ routine). Didn’t seem likely but the next thing to do was to identify what device was showing as being duplicated. After making a note of the two offending PDOs (the newly reported PDO 88790170, and the one that it was a duplicate of 8991ac40) I ran…
!devnode 0 1
…and searched the output to track them down. Pretty quickly I found the following:
Running !devnode 0 1 shows the devnode structure in a hierarchical format which makes it much easier to read the results. So despite my first thoughts that maybe the crash was linked to one of the many licence manager we’ve got running on the server (along with their USB dongles) I think that I might actually be looking for a team member who has been sticking a SanDisk Cruzer USB stick somewhere they shouldn’t…