General System Reliability

2007-12-25 11:47:00

It has been over 3 weeks since I posted the original message, but it was

something to the effect of "Is there a command (or commands) I can run that

will show me what is wrong with a system and can show me the reliability of

it?" The reason I asked was because we were having problems on two E3000

boxes. One box didn't have anything wrong with it, and was definitely

user-error related. However, the other 3000 had SCSI errors in the

/var/adm/messages file, ranging from "timeouts", to "reset" timeouts, to

"resyncing" messages all the way to "random position errors" on the hard

drive. Sun Service recommended we make sure the external SCSI port on I/O

Board 1 was terminated. So we terminated (in the SCSI sense) all of our

3000's (none of them were terminated on that board). Then after more drive

replacements and SCSI errors, Sun finally figured out what was wrong: We

were using the external SCSI port on I/O Board 1 with an external DDS 4mm

Sun tape drive. The 3000 that was having all of the hardware errors had

all internal drive bays filled (10 9Gb drives), so more than likely that's

the reason that machine specifically was reporting errors - possibly a

heavier load on the SCSI chain so not quite as reliable if it's not

terminated. Now that we've terminated all I/O Board 1 SCSI ports, and are

now using I/O Board 2's external SCSI ports, it seems that we're not having

any other messages. It has only been 8-10 hours so far, but that's a good

track record already.

Many thanks to the people who suggested checking the SCSI connections,

making sure termination was happening, checking the /var/adm/messages

files, using SyMON, making sure cable length wasn't too long, and making

sure that no pins were bent/missing from the cables.

Thanks so much for everyone's help!

Damon

Comments

Got something to say?

You must be logged in to post a comment.