Memory problems on E450

2007-12-24 22:19:00

Hi Guru's,
I have a strange problem with one of our E450's. About a month ago i
started getting the following errors in /var/adm/messages stating that
Memory module 1904 was experiencing memory problems

foo unix: [ID 908439 kern.notice] [AFT0] Multiple Softerrors:
foo unix: [ID 356634 kern.notice] 3 Intermittent, 253 Persistent, and 0
Sticky Softerrors accumulated
foo unix: [ID 340762 kern.notice] from Memory Module 1904

That seemed a straightforward error and i requested that a Sun Engineer come
and change the module. When he arrived he moved a known good module into
slot 1904, and placed the new module in 1901 (This was done to ensure that
it wasn't the slot that was causing the problem). This seemed fine and we
booted the machine up again and ran SunVTS stress test.
The same errors occured again, but this time the errors were coming from
1901. We naturally thought that the dimm was bad and replaced this again,
this time placing 1804 into 1901 and the new DIMM in 1804 ( This was done to
rule out a faulty bank that was holding the 190x Dimms.

We booted up again and all seemed fine. SUNvts passed with no errors, and we
left it and that. A day later though, the problems started again - this
time from 1804. However the error messages were somwhat different

foo pcipsy: [ID 758641 kern.info] AFSR=40830000.a4800000
AFAR=00000000.d0610fa8,
foo double word offset=5, Memory Module 1804 id 4.
foo pcipsy: [ID 553544 kern.notice] syndrome bits 83
foo pcipsy: [ID 865758 kern.warning] WARNING: correctable error from pci0
(upa mid 4) during
foo DVMA read transaction

as well as:

foo unix: [ID 908439 kern.notice] [AFT0] Multiple Softerrors:
foo unix: [ID 356634 kern.notice] 3 Intermittent, 253 Persistent, and 0
Sticky Softerrors accumulated
foo unix: [ID 340762 kern.notice] from Memory Module 1804

I got onto SUN support who told me it looked like a motherboard error. We
changed the motherboard, and again SUNvts passed all tests.

To my disgust the errors are back again. I have run SUN explorer on the host
a number of times which SUN have analysed, but have found no problems. Their
suggestion now is to break the memory interleave, disable a bank at a time
to try isolate the problem. I can't do this however as it is a production
host and all 4gb of memory is needed.

I have search extensively in Sunsolve etc.. for clues but to no avail. I did
notice however that some people have had problem with E450's incorrectly
diagnosing a failed DIMM. prtdiag does not show any errors at all.

Has anyone come across a problem like this before, and if so what was the
cause?

E450 spec -- 4x480mhz processors, 4gb Mem ( interwoven)
Solaris 8 Patch 108528-12
Sunvts version 4.6

I will of course summerize no matter what the outcome.
Thanks
-Padraig

**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

This footnote also confirms that this email message has been swept by
MIMEsweeper for the presence of computer viruses.

www.mimesweeper.com
**********************************************************************

Comments

Got something to say?

You must be logged in to post a comment.