Sorry, realized I never summarized this topic. Subject was (June 2004): ===================================================================== Hi all, I was just wondering if some of you have experienced similar problems with V440 machines ? One of my customers is having several V440 and one of them has already rebooted unexpectedly three times over a month. Nothing can be found in /var/adm/messages, everything looks fine, no warnings nor errors, and from one line to another I suddenly have the begining of a reboot !?!? I cannot find anything neither in other log-files pointing to a potential serious problem or warning !?!? The system is currently part of a test-cluster and when I came in this morning a service group had failed-over during the night and logs tell me the node went down. Jun 21 11:59:55 sam gab: [ID 316943 kern.notice] GAB INFO V-15-1-20036 Port h gen 31df4d membership 0123 Jun 23 20:00:37 sam genunix: [ID 540533 kern.notice] ^MSunOS Release 5.9 Version Generic_112233-12 64-bit The customer told me they have already had some unexpected reboot on V440 on site for some of their own customers. Seldom however. So ... have some of you experienced similar behaviour ? The cluster has some other V440 with identical configurations (hardware, firmware, patches and software), but only one V440 is unstable !?!? and reboots, even if seldom, apparently are random (I have so far not been able to identify any possible pattern). The V440 has 4 CPUs (1062 MHz), 8GB RAM, connected to a SAN (using QLogic 2300 cards), 4x36GB internal disks, OBP 4.13.0 and as far as I can see everything looks fine !?!? /Pascal ===================================================================== Solution or let's say explanation: ===================================================================== As few of you mentioned at that time, the problem was supposed to be due to a bug on the V440 machines. You were right. I have been talking with Sun those past few months and the problem is related to the V440 motherboards. Before providing some more information, there is no patch for this problem, the motherboard has to be changed. New boards are now expected within a few weeks. Let me now provide you with more information. There is an internal document at Sun "InfoPartner Document FIN Doc ID: I1099-1" which in details describes the problem which has been known for a while now. I only got a copy of the document (5 pages) so I cannot just cut and paste. Synopsis: Sun Fire V440 and Netra 440 systems using a specific networking configuration may unexpectedly reset. Platform: A42 and N42, model ALL. Part number affected: 540-5919-XX, FRU, ASSY, Motherboard, Netra440 and 540-5418-XX, ASSY, Motherboard W/CPU cage, CHLPA BugID: 5039862 Problem description: In an extremely limited number of applications, and with a single system configuration, the Sun Fire V440 or Netra 440 system may experience an unexpected reset and will reboot. The specific configuration which triggers this situation is as follows. Some or all of the data being transferred is transported via the first onboard ethernet interface "ce0" (Cassini ASIC) When this issue occurs, the system will reset and an error message appears on the console. The system then reboots. No core files are generated and the reset output will not be logged to the /var/adm/messages file. If it is suspected that the V440 is experiencing this issue, change the OBP variables as follows to provide more verbose output on the next failure diag-switch? true post-trigger none obdiag-trigger none Corrective action: There is currently no permanent resolution. Customer sites experiencing this issue should use the workaround procedures provided below. A long-term corrective action plan is being developed by Sun and will be delivered via Sun's service team. - Use only the second "ce1" (net1) onboard network interface OR - Install a PCI ethernet card in any available PCI slot. The following Sun card is tested and supported as a workaround for full gigabit network replacement functionality: X1150. Other tested and supported card but without gigabit support is X2222A. It is highly recommended that to ensure the "ce0" (net0) is never accessed inadvertantly in a matter that could trigger this issue, that the "ce0" interface be completely disabled. It is also recommended due to Solaris instance numbering, that this be done after initial Solaris intallation, to ensure net1 is assigned "ce1" instance, instead of "ce0". To completely disable "ce0" (net0) from the system, use the following commands to install an NVRAM script at the OBP "ok" prompt: 1 ok nvedit 2 0: probe-all install-console banner 3 1: " /pci@1c,600000/network@2" $delete-device drop 4 2: Type "Ctrl-C" to exit nvedit 5 ok nvstore 6 ok setenv use-nvramrc? true use-nvramrc? = true 7 ok reset-all Ather the system resets, "ce0" should not be visible by OBP (i.e. you should not see a path to "ce0" (/pci@1c,600000/network@2) when you run "show-devs" from OBP). ce0 device should not be seen by Solaris (i.e. prtconf or prtpicl). Anyway, if you experience this issue, contact Sun and propre action will be taken. Sorry for summarizing so late, but better late than never. Regards, /Pascal _______________________________________________ sunmanagers mailing list sunmanagers@sunmanagers.org http://www.sunmanagers.org/mailman/listinfo/sunmanagersReceived on Tue Nov 16 15:14:53 2004
This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:43:40 EST