Sooo, I bought a couple of very cheap Power Edge 2650's off of ebay, well everything was hunky dory, they were filthy so I've cleaned them out inside and spent a while updating the BIOS and Firmware versions (hence the 3.5" floppy disks from my earlier post)... but last night something weird happened. I issued a shutdown -r now to reboot the machine and it never ever came back up.
It was sat with the front power light flashing constantly, the front LED would come on flashing blue and the server name (which I had put in the BIOS) would show. But that was it, the fans were spinning like mad but no screen activity, nothing on the ESM NIC nothing on the on board network cards... just the front power light flashing on and off constantly.
I've read all over the internet about this, everyone and their mother has suggestions... from stripping the whole machine down, to something shorting the main board out, to a dodgy PSU fitting, even as far as a known bug in the BIOS (which I have no trace of, except their word) whereby shutting down with Ubuntu in APCI mode results in the machine being un-bootable.
However, for me, it was none of those. It was that one of the processor voltage regulars had failed. I found this out by taking out the secondary processor and then using its voltage regulator on the primary processor. Low and behold the machine came to life.
Now, this isn't the end of the story, I found this state the machine had been in very frustrating... the front panel lit, the green power light flashing constant, just like its in standby... but its not, that's not what its all about. According to an ex-Dell employee whom I got up on the wire earlier its all about sort of making sure that Dell get to sell their customers a new server now and again. I'm not knocking them for it, I'd love a brand new server and they have to drive their business some how... however, this sort of semi-built in problem, which server engineers know about and can spot is a contrivance, the BIOS should be able to tell you there's a voltage error in these circumstances; after all it was able to tell me its own name from the settings I'd put in...
So, I've set about adding and removing components, I've deliberately scored lines in the back of ECC ram modules and broken pins off of processors, cut fan wires and ruined a perfectly good 300w slide in PSU to help you guys read all the signs of these unreportable (on the front panel) states that look like stand by... so anyone wanting to give me a few quid for ruining a perfectly good server for our collective benefit can do so... please god do so.
Anyway here's the list:
Symptoms:
Front panel/bezel showing blue - normal.
Front panel power button green LED flashing off & on at a constant rhythm.
Fans on full blast for ~10 seconds then they slow to 50% speed and sit idle.
Various fan indicator lights are green (even if you remove a fan)
No screen activity at all (not even BIOS post)
No activity on network cards.
No activiy or access to ESM via rear ESM NIC.
This is what I'm going to term the zombie state, because the front panel is basically telling you that the machine is in standby mode, but everything else is saying its hung. Even stripping the system right down and rebuilding it, if you build back in the one component which is faulty it'll hang again.
So, my primary recommendation is, do a slow build up from nothing... remove everything right down to the motherboard coming off the holding tray... brush everything with a paint brush to remove dust and debris, use an air duster if you have one around... clean it all carefully.
Check the motherboard for any obvious signs of damage or burning - which if you see, read no more, go buy a new motherboard.
Causes & Specific Signs:
CPU Undervolt - If processor 0 (the main CPU) has insufficient voltage the green fan LED on the main board closest the SCSI back board (front edge) closest to the processor will stay off, with only one of the four quarters lighting up very dimly orange. The solution here, replace the voltage regulator.
If you can't replace the voltage regulator, remove your secondary CPU, and use its voltage regulator.
CPU Undervolt - If Processor 1 (the secondary CPU) has insufficient voltage both the green fan LEDs closest the front edge on the main board will remain dim, with only one quarter showing a very dim flicker of orange.
Again here you must replace the voltage regulator, or remove the secondary CPU, until you can replace it.
With both these CPU Voltage situations, and indeed with a dead CPU itself, you can run your 2650 on a single goo CPU and regulator, at least check it now posts the BIOS.
Dead memory - this shows up on the front panel in orange and doesn't got into the standby like mode.
Dead PSU - a dead PSU can cause this issue, but only if its connected and dead and the other PSU is slightly off connection. I could get the machine to go into the zombie standby like state just by having two powered PSUs connected, but not firmly pushed home. Push one or both PSU's home and it'll spring to life.
+12 volt battery failure - this gets shown on the front panel LED in orange.
RAID memory module damage - if you have the RAID key installed and the RAID battery and the memory module you have for RAID memory is defective you can't really get it to hang in the standby like mode, however you do end up stuck when the RAID BIOS tries to load. I didn't see any failure warning shown in orange however, so unless you've got a screen directly wired into your 2650 you're not going to know what's going on.
RAID key not inserted fully - this can cause the system to sit in the standby idle state, however, the difference here is that all 4 CPU fan green LED's are lit green. So, this looks like same symptoms but this time the internal four fan LED's around the CPU's are all lit green. Even if there are no fans installed! So, make sure your raid key is pressed firmly into place, and both blue side holders are clicked up into the vertical to hold it into place.
Don't forget while you're pulling things in and out of your machine that if you leave the PCI riser card up a fraction, it'll stop your machine powering on... this is not zombie mode... this is where not even the fans come on... so press the PCI riser cards down into place (with the blue lever on the left of your chasis).
The next failure is of the front panel itself... if it has failed you'll need to look at the rear light only, that rear light won't give you the green power LED clue about the zombie mode. What will happen in zombie mode is that the machine will show blue and maybe flashing blue on the rear and that is all.
In this situation, the approach I've found best is to uninstall both processors, or all the RAM, and fire up the machine by applying power, then you will get a definite orange on the rear - this tells you that your machine is not in zombie mode, its just stuck in the boot somewhere. If however after removing all the RAM and both processors the machine still shows blue on the back.. its in zombie mode... so start to strip out and check the processors.
Top tips:
- Look for screws, debris or dust shorting the mainboard.
- Look for damage to the mainboard.
- Make sure your PSUs are firmly inserted.
- Make sure your voltage regulators are both working.
- If in doubt go down to a singe CPU, then try the other CPU, then try the other voltage regulator.
- Seat your RAID Key.
- Seat all the RAM and the RAID RAM.
And if in doubt, contact me.