in-memory file corruption

2007-12-25 7:23:00

Sorry that I haven't posted the summary earlier, but I wanted to wait to

ensure (to the best of my ability) that the problem had been indeed solved.

------------------------------------------------------------------------

PROBLEM: Apparent corruption of in-memory copy of files.

Configuration: 4/490 with 4 1-GB IPI disks

                64 MB memory

                4.1.1 generic kernel, NO patches applied

                Using tmpfs for /tmp

SYMPTOMS:

At least two times that I have detected, the in-memory buffered copy of a

file has become corrected. Both times it was detected was because an

executable aborted.

1. /usr/bin/csh gave segmentation fault. Although the dtm of the file

indicated that it had not been modified, I restored a copy from backup. A

cmp of the 2 files gave:

        cmp -l /usr/bin/csh ./csh

        114662 360 0

        114670 360 0

        114678 20 0

        114686 360 0

I moved the "bad" version in /usr/bin to another name and replaced it with

the new "good" version. Several hours later when I compared the files, they

were identical. Some of the diskless SLCs being served from the 4/490

server exhibited the same behavior with csh, others did not. Note that the

bytes that differ were 8 bytes apart.

1. /usr/lang/SC0.0/as died with an illegal instruction. Again, dtm of the

file indicated no modification.I restored a copy from backup, and compared

the files. Again, several bytes were different (with 8 byte offsets):

        cmp -l ./as /usr/lang/SC0.0/as

          8166 46 366

          8174 7 367

          8182 200 360

This time I tried clearing the file system buffer cache by tarring a large

(40 MB file) to /dev/null. After the tar, the files compared as identical.

------------------------------------------------------------------------

SOLUTION:

For once it appears to actually have been hardware. Here is my story:

1. I had run sundiag kmem and vmem tests as well as the limited CPU test

(FPU?). No problems were detected.

1. Sun strongly suggested applying the NFS jumbo patch (100173-03), since

it is reputed to fix some UFS as well as NFS problems. So, I installed:

100173-03: Date: 01/April/91 NFS Jumbo Patch

100174-01: Date: 03-Dec-90 SunOS 4.1.1: fixes for tmpfs bugs.

100259-01: Date: 02/Apr/91 SunOS 4.1.1: ufs_inactive patch

These did NOT solve the problem.

2. I had Sun come out with a new CPU and 2 new memory boards, with the

expectation of swapping them out. Sun software support had no other

suggestions. When we booted the CPU in diag mode to run the extened PROM

diagnostics, the system continually looped, printing the following:

        Boot PROM Selftest.

        EPROM Checksum Test.

        Context Register Test.

        Region Map Write-Write-Read-Read Test.

but before reached:

        Region Map Address Test.

We replaced the CPU board, and executed all of the PROM diagnostics

successfully. The problem has not ocurred in the last several weeks, so I

feel fairly confident that we have solved the problem.

------------------------------------------------------------------------

Suggestions were:

        Sun software bug (likely suspect)

        3-rd party software with privilidges.

        Malicious root user

        bad disk controller (possible suspect - problems only on READS)

        bad CPU board (the REAL problem)

        bad memory boards (likely suspect)

        bad ethernet port

        bad ethernet transceiver

                (probably not either of the above, since the server had

                problems as well as the client).

        tmpfs and NFS/UFS problems

        soft ECC errors

        SCSI cabling problems on SCSI disks

                (not our problem -- we only have IPI disks)

        corrupted shared libraries

                (a distinct possiblity -- I had a SS1 that failed FPU tests

                when shared libc got corrupted.)

Thanks to:

From: "Ric Anderson" <ric@cs.arizona.edu>

From: curt@ecn.purdue.edu (Curt Freeland)

From: feldt@phyast.nhn.uoknor.edu (Andy Feldt)

From: bjk@pecos.rc.arizona.edu (Brian J. Kennedy)

From: John Posey <posey@utdallas.edu>

From: bparent@calvin.UCSD.EDU (Brian Parent)

From: edm@MDI.COM (Ed Morin)

From: bob@omni.com (Bob Weissman)

From: tessi!joey@nosun.West.Sun.COM (Joe Pruett)

From: mp@allegra.att.com (Mark Plotnick)

From: Gerald Justice <justice@dao.nrc.ca>

From: David Stewart <das@edee.edinburgh.ac.uk>

From: sundev!ronin!kevin@Sun.COM (Kevin Sheehan {Consulting Poster Child})

----------------------------------------------------------------

Doug Neuhauser Seismographic Station

doug@perry.berkeley.edu ESB 475, UC Berkeley

Phone: 415-642-0931 Berkeley, CA 94720

Comments

Got something to say?

You must be logged in to post a comment.