For a change, a solution rather than a problem:
The following bug report with analysis and fix was sent to both Sun
and Solbourne. This should clear up at least *some* of the hung
pty problems in the Sun bug list. One of my colleagues is pretty sure
that this may solve the problem of mysterious rdump hangs that he and
others on this list have been seeing.
Those of you with source can apply the patch and rebuild your kernel.
Get the driver source from /usr/src/sys/os, apply the patch and place
the result in /usr/sys/os. When I hear from Sun I will find out if I
can redistribute patched binaries. If anybody on the list knows the
legalities *FOR SURE* (a Sun employee maybe?) feel free to let me
know.
Cheers,
Andy Sherman/AT&T Bell Laboratories/Murray Hill, NJ
AUDIBLE: (201) 582-5928
READABLE: andys@ulysses.att.com or att!ulysses!andys
What? Me speak for AT&T? You must be joking!
------- Forwarded Message
To: hotline@sun.com
Subject: pty driver bug - WITH FIX!
Date: Thu, 07 Jun 90 19:48:50 EDT
This report applies to the pty driver in 4.0.3. It can be observered
in sun3 and sun4 architectures. I believe that the bug may still
exist in 4.1, but I haven't upgraded yet.
This may be the same bug described in Bug Reference Number 1014706.
The description in the synopsis indicated that nobody had figured out
how to reproduce the bug. The bug I report here may be reproduced at
will with the attached code.
PROBLEM:
^^^^^^^
The intermittent behavior I observed was that a pty with background
processes still attached to it would sometimes become unavailable to
new opens by in.rlogind, in.telnetd, or a window system. What the
user would observe is that all attempts to use rlogin, for example,
result in an immediate "Connection closed" message. I attempted to
add an additional vhangup call in in.rlogind just after it forks (as
is done in the 4.3-tahoe code) but that did not unstick the pty. The
death (natural or otherwise) of the background process attached to the
pty will always unstick it. What was puzzling about the problem was
that a background process (such as Ingres) could run happily for
*DAYS* on a pty with lots of login/logout activity and then suddenly
become stuck for no apparent reason.
I have discovered that the pty becomes stuck when it has a background
process attached to it, *AND* a user exits a shell attached to it with
"stty 0". This will do it every time, as you can verify with the
following program.
/* hangit.c */
#include <stdio.h>
/* Fork a child that sleeps for an hour. This is to be
* used to test wierd behavior of pty's when background
* processes have open files, the "brain-dead pty" bug
*/
main()
{
int pid;
if ( 0 == ( pid = fork() ) ) {
fprintf(stderr, "Going to sleep for an hour\n");
sleep( (unsigned) 3600 );
exit(0);
}
else if ( -1 == pid ) {
perror("fork");
exit(-1);
}
else exit(0);
}
If you compile the program into hangit, you can then recreate the
problem by doing the following from any shell in an rlogin or telnet
session.
$ nohup hangit & # nohup only required for sh, not ksh and csh
$ stty 0
WHAT THE BUG IS
^^^^^^^^^^^^^^^
The driver error is contained in the module /usr/sys/os/tty_pty.c.
When the TCSETAW ioctl (issued by /bin/stty) set the speed to zero,
the pty flag PF_SLAVEGONE is set. This causes all further I/O to
return with an error, hence in.rlogind shuts closes master and slave.
If there is no background process, the driver close routines are
called for both master and slave. This releases all of the slave side
streams queues associated with the pty. When the slave is opened
again, the lack of queues is the signal to reset parameters in the pty
structure, which resets PF_SLAVEGONE and restores a non-zero baud
rate. If background processes are attached to the slave, none of the
shell or rlogind closes bring its open count to 0, so the driver close
for the slave is never called, and queues are still attached to the
pty structure. When the slave is reattached by a subsequent rlogin,
the baud rate is still set to zero and PF_SLAVEGONE is still up, since
the driver open code is not called. All subsequent I/Os return with
errors and rlogind exits. Also any slave-side ioctl will see the baud
rate as zero and make *sure* that PF_SLAVEGONE stays asserted.
THE FIX
^^^^^^^
This problem is fixed by a small change to the master side open
routine. There are two ways to solve the problem. I have elected to
mark the master as busy if the slave side is still busy, as indicated
by streams queues being attached to the pty structure. One could
emulate the BSD driver by just taking the opportunity to reset
PF_SLAVEGONE and restore the baud rate, but I believe that presents a
security hole, since the background processes, despite a vhangup, can
still open /dev/tty and write or read it, to the detriment of the new
session. The following diff applied to pty_tty.c will fix the
problem.
*** tty_pty.c.orig Thu Jun 7 19:09:47 1990
--- tty_pty.c Thu Jun 7 19:12:54 1990
***************
*** 618,631 ****
/* XXX - should be EBUSY! */
if (pty->pt_flags & PF_WOPEN)
wakeup((caddr_t)&pty->pt_flags);
! if (((q = pty->pt_ttycommon.t_readq) != NULL) &&
! ((q = q->q_next) != NULL)) {
/*
! * Send an un-hangup to the slave, since "carrier" is
! * coming back up.
*/
! (void) putctl(q, M_UNHANGUP);
! (void) putctl1(q, M_CTL, MC_DOCANON);
}
pty->pt_flags |= PF_CARR_ON;
pty->pt_send = 0;
--- 618,629 ----
/* XXX - should be EBUSY! */
if (pty->pt_flags & PF_WOPEN)
wakeup((caddr_t)&pty->pt_flags);
! else if (((q = pty->pt_ttycommon.t_readq) != NULL)) {
/*
! * Busy controller because slave still open somewhere
! * This avoids security hole in vhangup & /dev/tty.
*/
! return(EIO);
}
pty->pt_flags |= PF_CARR_ON;
pty->pt_send = 0;
------- End of Forwarded Message
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:05:57 CDT