My thanks go to: Octave Orgeron <unixconsole@yahoo.com> Paul Keller <pkeller@cisco.com> Kailashnath Rampure <kailash@tivo.com> "Mike's List" <mikelist@sky.net> Hichael Morton <mh1272@yahoo.com> Serguei Borkov <sborkov@yahoo.com> "Troy Abernathy" <tabernathy@r2-tech.com> mike.salehi@kodak.com Tim Chipman <chipman@ecopiabio.com> "Miller Sutfin" <millersutfin@earthlink.net> "Karl Vogel" <vogelke@dnaco.net> Many think that this is a hardware issue that has something to do with the faulty design of the faster chip (400 Mhz or faster) aggravated by heat. The "bad" combination seems to happen to servers with multiple 400/450 Mhz cpu's (sun cluster?) in the server room that is not very cool or not well ventilated. Replacement of the defective CPU usually fixed the problem. Octave Orgeron seems to summarize it best: " ... Your question about the E-Cache Parity problem on the 400Mhz USII CPU's is a fun question to answer. The problem came from a manufacturing error from Solectron, who assembles the CPU module, TI makes the CPU. They used some sub-standard cache modules from IBM that caused issues and the thickness of the PCB was not right. As a result, anything from heat to radiation could cause the E-Cache error. There are two paths to fixing this.. one is install the patch, that disabled the E-Cache.. this causes *serious loss* in performance. The other path is to get a replacement CPU module, make sure that it's built in Canada, those are the good CPU modules.. it'll say "Made in Canada" on the side:) " Paul Keller <pkeller@cisco.com> indicates that Sun has the new mirrored Ecache that addressed this particular Ecache problem: "... I feel your pain. Sun eventually came out with a 400MHz processor that came with mirrored eCache. That seems to have helped the 400s .... But, we've been seeing a lot of the same problems with the 440s that run on the Netra hosts. To my knowledge, they haven't dealt with that yet. " Serguei Borkov <sborkov@yahoo.com>: Had it before, done 2 things: replaced CPU, and rearranged environment to run at about 35C on CPU. Problem seemed to be gone. And according to Troy Abernathy <tabernathy@r2-tech.com>, a Sun reseller: Based on my understanding from my engineers and clients, the revision 501-5661 and above for the 400 MHz processors eliminated the problems. As far as using a patch, I am pretty sure that there is no such fix. He can either check all of his CPU's to see what their part numbers are, or have them replaced. That can be a very expensive task though. I am a reseller and I have the CPU's listed for $995. If he has many to replace that could be painful. If he needs additional help or would like to speak with one of my engineers let me know and I will hook them up. Good luck. Tim Chipman <chipman@ecopiabio.com> provides yet another perspective about this problem: Sun will not replace the CPU until the same chip crashes a couple of times (now try that on a critical production Sun cluster with many CPU's) There is no way one can predict this in advance (even with SunVTS diagnostics). " ... The "solution" from sun (in my experience): If a single CPU has more than 2 hits of the e-cache error, the part is considered flawed and is replaced. AFAIK, there is nothing in software that can be done. The later rev of the CPU is a model with "mirrored E-Cache", and isn't prone to this fault. However, I don't think there is any fix for older (affected) parts which are showing symptoms - other than replacement. Even more fun, I'm not aware of any way to detect the potential for the problem other than wait for it to strike. We have a 3500 here (4 CPUs) and endured numerous e-cache related crashes, because different CPUs were doing it. Sun wasn't willing to replace any parts until a single specific CPU showed "multiple failures" ... I can't imagine what somebody with an 8-cpu (16?) cpu system would do - simply wait patiently until 16 or more e-cache related crashes happen, and then force sun to replace all the parts en masse ... ? Clearly, I am not so amused with the entire issue. If Sun was doing the "right" thing, they should pro-actively replace all old chips at risk which are still in service rather than waiting for people to suffer repeated crashes and then get the replacement. However, the chances of them doing this ... seem minimal. " Suprisingly a couple of people reported that the Sun's "memory scrubber patch" seem to work for some ecache problem and some CPU's. So far I have yet seen any FCO (field change order) from Sun. This problem reminds me of the sticky head problem on the Pro Quantum 100 MB disk drive or the intermitten problem with the Vixel fiber module. Tony _______________________________________________ sunmanagers mailing list sunmanagers@sunmanagers.org http://www.sunmanagers.org/mailman/listinfo/sunmanagersReceived on Tue Apr 23 20:30:44 2002
This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:42:41 EST