Thank you to everyone who took the time to respond; very helpful and informative. Original Question: >I have a number of Sun 280R machines that Im currently installing for a client, and I got asked the question about what happens when a >CPU fails. > >I guess one of the following may happen:- > >1) Server Reboots with a single CPU >2) Server Dies >3) Server Continues running with a single CPU > >The rough configuration of each machine:- > >280R with 2x1.2 CPU/ 2GB RAM / 2x73gb Mirrored disks / Dual PSU's >Solaris 2.8 / Patched to April 03 Additional Option I missed out: 4) Server reboots, runs until the failure occurs again, and again, and again... Short Answer: General concensus is that any of the above can occur, though nature of the failure - either consistant or intermittant fault has a large part to play on what happens in real life. Most types of CPU or CPU Cache failure will result in a panic restart, if the failure is on the primary CPU it may render the machine useless (1 reported). If the failure is consistant the OBP will mark the CPU failed and restart the machine with the other good CPU(s). If the failure is intermittant the machine may not know the CPU is a problem and restart with it running, this will result in further panic restarts until the problem is resolved. Intermittant failures look like a pain, so its worth scanning through the logs, crash dump analysis ect, if your machine is restarting intermittantly. Additional Information: Randomly selecting some of the reports on what happens in detail:- A CPU error detected by the OS will almost certainly cause a panic and core dump. Portions of the kernel may be cached on the CPU, so it gives up. After the core dump, the hardware is reset and goes through a small POST procedure. If the error is persistent, then the OBP will detect the failure and will not provide the CPU to the OS for use. If the error was transient, then the OBP will not know about the failure, and the reboot will be similar to the last time. >> I've seen on other servers (of any architecture, not just Sparcs) is that typically a CPU will not "die" outright - it'll partially (and often, only intermittently) fail and make the machine unstable until you diagnose the problem and either fix or disable the CPU. >> It's variable dependent on the nature of the fault. The most probable course of events is a crash followed by an attempt to restart using the one good CPU. The success of this will be decided by the nature of the initial failure. The 280R is marginally better in this respect than say the 240/480/880 series of machines as it shares some architectural features with the higher end 4800/6800 systems. >> Server reboots about twice a day with double CPU but, in my case, the CPU was not completely dead, just defective. I had to turn off the defective CPU manually and , until the CPU was replaced, I was able to run the machine with only one CPU. >> Around here, the server panics, reboots, panics, reboots, pa.... >> I had a Netra 20 with 2 CPU's. The error messages (as I now remember them) referred to a dcache error, then said something about CPU failure and then the machine would go back to the 'OK' prompt. I do not know if that is the standard behavior on sun machines when there is a CPU problem on a computer with multiple CPUs, but that is the experience I had. >> I have the unfortunate cpu fail on a blade 2000. The sun rep was saying that the 280r and 2000 are same motherboard, so I think that you would find same results. The system would not boot at all. On power up the fans would spin, the drives would spin up. That's about it. The system would never go green no matter how long it sat in that state. >> The system would panic and reboot itself. Within 15 or 20 minutes of coming up to login it would panic again and reboot again. It always listed the same CPU as the problem in messages so I removed the listed CPU and the system stabilized out. This was at 0300 Saturday morning so I waited until later that day and had SUN come out with a new CPU. As soon as we put in the second CPU the system became unstable again. We moved CPU's around and it didn't matter which slot a CPU might be in it still kept rebooting. The SUN tech had to send the crash files to his backline and they determined the problem might be the motherboard. We replaced the motherboard on Sunday and the system has been stable since. >> "Fails" is a surprisingly vague word, at least for Sun. :-) Here are the scenarios. 1) If a system detects a problem with a CPU that can be recovered from, then it will offline the CPU and continue on the remaining one. This is theoretical, and I've NEVER seen it happen. 2) If something happens (bad CPU or memory) that leaves the system unable to guarantee the integrity of the system (i.e. OS and kernel), it will panic and reboot. Now... 2a) If the hardware is detected as faulty during the pre-boot diagnostics, it will offline the suspect the flakey processor and boot on only on CPU. 2b) If the problem was transient or at least not a really consistent error, then the system will boot with both processors active. If you do have a bad processor, then it will probably panic and reboot again before long. If you have a bad processor but the system isn't offlining it on panic/reboot, then you can force it off from the boot prom (which masks it entirely from the OS), or offline it from the OS level (which _mostly_ protects you from using that processor--but not entirely, especially if it's CPU 0). <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< << BTW: Perhaps I spoke too soon, the 250 dual cpu in my office has started having throwing wobblies and rebooting intermittanly, guess what its reporting in the logs, EBP CPU problem :O /me calls sun support Regards Jonathan -- This e-mail has been scanned for computer viruses, it is recommended that you re-scan this message and any attachments with your own anti-virus tools before use. Checked by AVG Anti-Virus (http://www.grisoft.com). Version: 7.0.176 / Virus Database: 260.1.2 - Release Date: 18/09/2003 _______________________________________________ sunmanagers mailing list sunmanagers@sunmanagers.org http://www.sunmanagers.org/mailman/listinfo/sunmanagersReceived on Mon Sep 22 16:57:33 2003
This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:43:20 EST