Hi managers, Thanks for all of you replied, too many to list here! I've got all kind of suggestions, all of them are very informative, so I decide to call Sun. Sun's explaination is very similar to Colin Bigam's email, he's the winner! they did mention "replace the DIMM after 3 persistent memory error in 24 hours" rule. Thanks!! Meg --- Colin Bigam <colin.bigam@west.gecems.com> wrote: > Hi Meg; > > We see errors like this quite often on various systems, especially > on the newer sunfires. > > Memory errors come in three types: Intermittent, persistent, and > sticky. When the computer detects a single-bit memory error, it > will go and refetch the data from memory. If it's correct the second > time, then the error was transient, occurring randomly in the path > from memory to CPU. If it's still in error, then the error is > persistent--the system recalculates and rewrites the data back to > memory. Then it reads it again--if the read-after-rewrite is still > in error, then the error is labelled sticky. > > Sticky errors indicate bad memory, and should be replaced. Sun's > recommendation for persistent errors is to replace the DIMM if you > get more than three persistent errors in 24 hours, or if there's > a steady trend of increasing persistent errors. Transient errors > are almost completely random (usually caused by cosmic rays!), and > are not serious unless you start to get them steadily, in which case > you probably have a bad system board. > > There are also numerous patches which apply to memory errors. Make > sure you have a fairly recent patch cluster for your OS version on > the box, especially making sure that you have the 'memory scrubber' > patch. (you can search for this on sunsolve.sun.com) > > Hope this helps, > Colin > -- > | Colin Bigam, Senior Unix analyst > > ----- Original Message ----- > From: Meg Wall <meg991@yahoo.com> > Date: Monday, April 7, 2003 11:32 am > Subject: Hardware failure? > > > Hi managers, > > > > I just got the following messages, do I have a > > hardware failure here? What these mean? Thanks!! > > I will work on my summaries today. > > > > Meg > > > > Apr 7 12:12:02 server32 pcipsy: [ID 854591 kern.info] > > NOTICE: correctable error detected by pci0 (upa mid > > 1f) during > > Apr 7 12:12:02 server32 DVMA read transaction > > Apr 7 12:12:02 server32 pcipsy: [ID 750218 kern.info] > > AFSR=40230000.7f800000 AFAR=00000000.1c9d0a58, > > Apr 7 12:12:02 server32 double word offset=3, > > Memory Module U0701 id 31. > > Apr 7 12:12:02 server32 pcipsy: [ID 916270 kern.info] > > syndrome bits 23 > > Apr 7 12:12:02 server32 SUNW,UltraSPARC-II: [ID > > 354824 kern.info] [AFT0] errID 0x0003cc92.01cb251e > > Corrected Memory Error on U0701 is Intermittent > > Apr 7 12:12:02 server32 SUNW,UltraSPARC-II: [ID > > 376402 kern.info] [AFT0] errID 0x0003cc92.01cb251e ECC > > Data Bit 33 was in error and corrected > > Yahoo! Tax Center - File online, calculators, forms, and more > > http://tax.yahoo.com > > _______________________________________________ > > sunmanagers mailing list > > sunmanagers@sunmanagers.org > > http://www.sunmanagers.org/mailman/listinfo/sunmanagers Yahoo! Tax Center - File online, calculators, forms, and more http://tax.yahoo.com _______________________________________________ sunmanagers mailing list sunmanagers@sunmanagers.org http://www.sunmanagers.org/mailman/listinfo/sunmanagersReceived on Mon Apr 7 23:15:34 2003
This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:43:08 EST