Host Lockup During Socket Transfers

2007-12-25 7:23:00

My original question:

> In attempting to run TCP socket transfers between two processes

>residing in the same SPARCstation host, I am experiencing system

>lock-up after some number of transfers have taken place (the number

>of transfers varys; it's usually between 950 and 2850 with 30 ms.

>intermessage spacing). Message sizes are 4496 octets. System

>configuration is:

>

> SPARCstation 1+

> SunOS 4.1

> GENERIC Kernel.

> 16 MB memory.

> 210 MB disk.

> 56MB swap.

> Send/Receive Queues = 4096

The trophys go to the following who discerned that I was in need

of the Loopback patch (100159-01):

Hal Stern stern@sunne.East.Sun.COM

Mark Plotnick mp@allegra.att.com

Daniel Quinlan danq%chs@boulder.colorado.edu

I was surprised to find that the loopback driver gets invoked even

without using the loopback address, (Thanks again to Hal Stern):

        "when the IP layer sees dest IP == my IP, it gives the packet

    to the lo device driver instead of the ie or le driver."

Other Responses:

---------------

from Lixia Zhang lixia@parc.xerox.com

Dont know if you've already got help from others. To me the system

resources you ran out seem to be the ports. I believe you can only

have a limited number of active tcp ports. When transmission is finished,

the closing connection must wait for 2*T time period before the port being

freed, where T is the max life time of pkt in the net (this wait period

is for reliability reason, to make sure all previous pkts have dead before

the port can be reused). You may check the appendix of RFC1185 to find

out how long this T value is (if my memory is not faulty, I remember it

is mentioned there).

From: Mike Raffety mcnc!oddjob.uchicago.edu!oconnor!miker

I don't suppose your transmitting host has been up a long time (e.g.,

100-130 days), has it? I discovered a bug a year-ish ago in Sun's TCP

code. It's rather complex, but let me try to explain it ...

When you open a TCP connection, a byte counter is assigned to it, which

is simply a copy of a system counter that ROLLS OVER at 2^32, or about

18 weeks. Once the stream is opened, and the counter for that stream

initialized, that counter is incremented by one for each byt/octet

transmitted. If your machine is up long enough, and you transmit

enough data, eventually that stream-specific counter rolls over (the

closer to that magic 18 +/- weeks, the less data it takes to get

there). The RECEIVING TCP side DOESN'T roll over properly, so it fails

to recognize the packet after the rollover occurs, and asks for a

retransmit of the "right" packets. With backoff algorithms, this

quickly settles down to near-silence. Once the SYSTEM counter rolls

over, everything works fine again ... until the next rollover

approaches.

Many thanks to all who responded and to Mike Fischbein at Sun's Albany

office for mailing me the patch. Works like a champ!

John Thier

GE Defense Systems Division

Pittsfield MA

thier@orcad2.dnet.ge.com

Comments

Got something to say?

You must be logged in to post a comment.