[Summary] UltraSPARC II cache problem?

From: Manesh, Nasser (CAP, PTL, Emplifi, Consultant) (Nasser.Manesh@penske.com)
Date: Thu Sep 07 2000 - 08:12:06 CDT


Dear everybody,

Thank a lot for all the replies I received, specially since those who
replied tried to share their experiences on this "vague" issue with me to
the best extent they could.

Here is a summary of what I received:

1. The cache problem exists.
2. The issue is exaggerated by the media (well, as always!)
3. Most of the reported cases are 400MHz UltraSPARC II CPUs with 8MB
external cache. I also received the chip vendor name (Several vendors make
these chips for Sun), but I hesitate to mention here to avoid discussions...
4. I had _one_ case reported by one of you gentlemen on 300MHz CPUs used in
an Ultra2 box, so I am not sure if it was the same problem or not. The
symptoms are similar.
5. The symptoms are crashes and reboots without warnings with messages like

CPU5 UE Error: Ecache Copyout on CPU4: AFSR 0x00000000 01004000 AFAR
0x00000001 fd2c0ef0

6. I did not see any known test procedure to make sure if a system is
affected or not.
7. Generally it seems to be seen on larger series like 5500 and 6500, but
nobody is sure about it. Sun says it does not affect departmental series
like 220Rs and 450. Peronally I had a 450 with the very same symptoms four
months ago, and they came over and changed the cpu module. Fixed.
8. Everybody is in agreement that if a server runs without crashes for a
week, it is not affected.
9. [LAST MINUTE UPDATE] I received this right before I post my summary.
This one is on 4MB cache with revision < 50. FYI, Ultra60 is exactly the
same board as 220R. But this case is _clone_ systems, that I don't know if
their cpu modules are different or not:
We have had four out of about 15 Ultra 60 clones crashing with ecache
errors. They are all equipped with 400MHz/4MB modules. Currently we are
in the process of exchanging all CPUs in the Ultra60 with new ones.
The vendor of the machines told us that there was a bug in the modules
with revision <= 50 (the revision number is printed on a sticker attached
to the module). The exchange modules have revs > 50.

After reading the computerworld article you referenced I wonder if this
is really a fix for the problem since the rev50 modules were shipped in
summer last year. All of our more recent modules have a rev. >50. So
in theory the problem should only exist with older machines. On the
other hand none of the machines where we exchanged CPUs failed again
afterwards.

The symptoms of our probleme were log messages of Ecache SRAM Data Parity
Errors followed by a crash of the machine. After rebooting them, they
usually ran fine for a couple of days before they again crashed.

I hope I'm not missing anything. If you have other new findings to add
please email me, and if I receive a bulk of new info I'll post an update to
this summary. Thanks again.

ORIGINAL POST:

Hello everybody,

Has anybody experienced this "cache problem" on UltraSPARC chips that is
floating around in the media?

Look at:

http://www.computerworld.com/cwi/story/0%2C1199%2CNAV47_STO49485%2C00.html?a
m

I need to know if this is something widely experienced or just an
exageration by media, since we're going to production in a relatively short
time with a bunch of Sun boxes, and we cannot afford a downtime casued by
this -- if it is really an issue.

If you have experienced it, please let me know:

- What the problem really is
- What the synptoms are
- Is there any specific test one can do on his/her own systems to make shure
whether they are affected or not

I'll summerize, and I believe it's to the benefit of most of us to know the
details about it.
 

> Nasser K. Manesh
> UNIX System Administrator/Webmaster
> Penske Technology Services
> Email: nasser.manesh@penske.com
> Voice: (610) 796-6527
> Fax: (610) 796-4387
>
>

S
U BEFORE POSTING please READ the FAQ located at
N ftp://ftp.cs.toronto.edu/pub/jdd/sun-managers/faq
. and the list POLICY statement located at
M ftp://ftp.cs.toronto.edu/pub/jdd/sun-managers/policy
A To submit questions/summaries to this list send your email message to:
N sun-managers@ececs.uc.edu
A To unsubscribe from this list please send an email message to:
G majordomo@sunmanagers.ececs.uc.edu
E and in the BODY type:
R unsubscribe sun-managers
S Or
. unsubscribe sun-managers original@subscription.address
L To view an archive of this list please visit:
I http://www.latech.edu/sunman.html
S
T



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:14:16 CDT