SUMMARY: Re: Kill runaway processes - need help writing cron job!

From: Sanjay Gowda (sgowda@idt.com)
Date: Wed Jun 26 1996 - 16:47:44 CDT


Hello!

I would like to begin this summary by saying a heartfelt
"THANK YOU" to everybody for some taking some time to
reply to my request. I have attached my original request in
the end. Basically, I wanted to write a script that will take
some action (like kill or send mail, etc), whenever it
encounters what it thinks are some runaway processes
which are hogging a lot of CPU time.

I resolved our issue by taking bits and pieces of several scripts
and created a cron job to check for runaway processes and email them
to me.

And I appreciate everybody for advising me NOT to kill programs,
but instead use some other method of notifying the users of their
processes hogging up a lot of CPU. I will be more careful
and will certainly be more proactive about such matters.

Once again thank you. Here are some of the relevant
replies/suggestions that I received.
  

Sanjay

PS: All the replies attached begin with -------- REPLY --------.
     So if you want to cut and paste a script, you should
     be able to distinguish between different scripts/replies.
     I did receive a number of ME TOOs. That is why I am
     including all the replies that I received.

-----------------------------------------------------------------------
Sanjay K Gowda sgowda@idt.com
System Administrator 503-681-6382
Integrated Device Technology Hillsboro, Oregon
-----------------------------------------------------------------------

-------- REPLY --------
From: Peter Tashkoff <TASHKOP@kiwi.co.nz>

Sanjay
I strongly recommend that you don't simply kill
these pids as a lot of processes can end up going
over whatever benchmark you may set.

This script will identify problem pids
ps -fp `ps -ec | grep -v defunct | grep -v PID | grep -v
oninit | sed -e 's/:/\.
/' | awk ' { if ($5 > 200) printf("%s,", $1); } '`

If you want to just kill them then change ps-fp to
'kill -whatever'

I strongly suggest that rather than do this you
address the base problem; Why are they not dying.
Under Sol 2.4 the /bin/sh program is buggy and
exhibits the behaviour that you have described.
If you are running sol 2.4 and using the bourne shell
for these processes; change to the korn shell and the
problem will go away.
HTH
Regards

--
--
Peter Tashkoff
NZ Kiwifruit                 Project Consultant/DBA
Marketing Board.         Standard Disclaimers apply
This posting may not be used by any commercial
organisation to vilify another.

-------- REPLY -------- From: "Rasana P. Atreya" <Rasana.Atreya@library.ucsf.edu>

Check out http://www.eecs.nwu.edu/unix.html

They have a section for shell scripts FAQs.

Rasana

-------- REPLY -------- From: bern@penthesilea.uni-trier.de (Jochen Bern)

Better have the runaway Processes annihilate themselves before you go out and send around SIGKILL as root from a cron Job. All you need to do is to limit the CPU Time a single Incantation of the Program is allowed to use.

If you do not have the Source Code, calling the following csh Script instead of <program> should do the Trick:

#!/bin/csh # limit -h cputime 10000 # or whatever Amount of CPU Seconds you want <program> $*

Regards, J. Bern

-------- REPLY -------- From: Stephen P Richardson <spr@myxa.com>

Included below is a script that sends email when processes that are consuming over 20% CPU are detected. It does not actually execute the kill command, it just sends it in an email for consideration ;-)

The script needs to have the email recipient filled in, and a regular expression needs to be added or completely cleared. The version of this in production is only considering a particular group of processes for possible eradication.

Note, you asked about a script to remove processes based on accumulated time, but for our purposes, large CPU consumption was the problem. Hope this helps.

-- Regards, Stephen

--------------------------------------------------------------------- Myxa Corporation Voice: (610) 436-0380 334 West Union Street FAX: (610) 429-9207 West Chester, PA 19382-3329 Email: spr@myxa.com ** Welcome Page: http://www.myxa.com **

--------------------------------------------------------------------- #!/bin/sh # # Wed May 1 09:25:07 EDT 1996 # # script to check runaway processes every hour. # a runaway process is defined as one that uses over 20% cpu resources # # runaways will be reported via email with the kill command listed for # cut-n-paste execution if desired. # # to get cpu % by process run /usr/ucb/ps -aux # %cpu is third column # # problem: columns run together when numbers are large. Try using character # spaces to define the field needed. # # WARNING: overflow may shift all character positions # # PATH=.:/usr/local/bin:/usr/local/etc:/usr/bin:/bin:/usr/ucb:/usr/sbin:/sbin:/etc:/usr/ccs/bin:

# !NOTE! This uses grep to select only processes of interest for this # particular program. An alternate would be to use grep -v to # de-select processes that should not be examined. # *** REGULAR EXPRESSION MUST BE FILLED IN OR DELETED HERE ***

/usr/ucb/ps -auxww | \ grep "XXXXXXXXXXXX" | \ grep -v grep | \ cut -c10-14,15-20,61- > /tmp/tmp-processes

export LINE # LINE is assigned to each input line # from the temporary file ( read LINE while [ -n "$LINE" ] do set $LINE if [ $2 -gt 20 ]; then mailx -s '***Runaway Processes***' RECIPIENT <<EOF $LINE

kill $1

EOF fi # echo "$LINE" read LINE done )< /tmp/tmp-processes

-------- REPLY -------- From: keith@oz.health.state.mn.us (Keith Willenson)

Included below is a script I run every night to clear off orphaned jobs left by our NCD X terminal

HTH,

K

++++++++++++++++++++

X-Sun-Data-Type: shell-script X-Sun-Data-Description: shell-script X-Sun-Data-Name: killer.script X-Sun-Charset: us-ascii X-Sun-Content-Lines: 24

#!/bin/sh PATH=/usr/bin:/bin export PATH user='sample1' #this could be a program name also proclist=`ps -ef | grep $user | awk ' $0 !~ /grep/ {print $2}'` rm -f /tmp/junk.$$ echo "cron script running on \c" >>/tmp/junk.$$ uname -a >>/tmp/junk.$$ echo "Following processes removed for $user" >>/tmp/junk.$$ date >>/tmp/junk.$$ echo "PASS 1 - cleanup" >>/tmp/junk.$$ for proc in $proclist;do kill $proc echo "kill $proc" >>/tmp/junk.$$ done sleep 30 #wait thirty seconds to give stuff a chance to die echo "PASS 2 - double check (should be blank below)" >>/tmp/junk.$$ proclist=`ps -ef | grep $user | awk ' $0 !~ /grep/ {print $2}'` for proc in $proclist;do kill -9 $proc echo "kill -9 $proc" >>/tmp/junk.$$ done mail admin_user </tmp/junk.$$ #admin_user is our alias for sysadmin or me rm -f /tmp/junk.$$

-------- REPLY -------- From: "ron d. parachoniak" <rap@physics.ubc.ca>

Well, for what its worth, here's what I do. Not very elegant but it works.

#-------------- # kill-hogs #-------------- # # job to kill all jobs running more than 5mins and not niced to -19 # # run this job from cron daily every 15mins # # ubc physics dept, by r.d.parachoniak 95-05-01

PATH=/bin:/usr/bin:/usr/ucb:/etc:/usr/etc:/usr/local/bin export PATH

# set up temporary file

TMP1=/tmp/hogs.$$

# set up file used to hold pids of hogs

HOGS=/tmp/hogpids

# set up log file used to hold top output of killed jobs

LGFILE=/var/log/killedjobs

# set up traps for system exits

# exit with status of 1 when hangup, software interrupt, # or software termination signal received

trap 'exit 1' 1 2 3 15

# execute this command whenever script exits due to exit command # or due to reaching its end

trap 'rm $TMP1 2> /dev/null ' 0

# place output of top in /tmp/hogs.$$

/usr/local/bin/top -bqu all > $TMP1

# debug # echo "------------------------------------------" # echo `date` # debug

# zap /tmp/hogpids file if it exists

if [ -f $HOGS ] then rm -f $HOGS fi

# run awk script to put pids of all jobs with # nice level<19, and # running for more than 5 mins, and # not owned by root, and # using more than 10% of one CPU # into file /tmp/hogs.$$

awk '$4 < 19 && $2 != 0 {print}' $TMP1 | tr ':%' '. ' | awk '$8 > 5 && $10 > 10 {print $1}' - > $HOGS

# debug # echo "PID NICE TIME UID CPU" # awk '$4 < 19 && $2 != 0 {print}' $TMP1 | tr ':%' '. ' | awk '$8 > 5 && $10 > 10 {prin t $1, $2, $4, $8, $10}' - # echo "-------------------------------------------" # echo "cat HOGS..." # cat $HOGS # debug

# write a record to logfile if there are jobs to kill

if [ `cat $HOGS | wc -l` != "0" ] then echo `date '+%y/%m/%d - %H:%M ` `awk '$4 < 19 && $2 != 0 {print}' $TMP1 | tr ':%' '. ' | awk '$8 > 5 && $10 > 10 {print}' - ` >> $LGFILE fi

# now kill all the jobs in $HOGS

for hogpid in `cat $HOGS` do # echo "kill -9 $hogpid" kill -9 $hogpid done

if [ -f $HOGS ] then rm -f $HOGS fi

if [ -f $TMP1 ] then rm -f $TMP1 fi

# debug # echo "------------------------------------------" # debug

exit 0

-------- REPLY -------- From: Rich Kulawiec <rsk@advsys.com>

I think I'd skip the entire programming exercise and get my hands on "skill", which is made to do just this sort of thing. Here's an article announcing its availability:

> From: earle@poseur.jpl.nasa.gov (Greg Earle - Sun JPL Software Support) > Newsgroups: comp.archives > Subject: [sun-spots] Re: 4.1 patches for Kernel sensitive programs (sps,fstat,ofi > Date: 21 May 90 00:55:53 GMT > X-Original-Newsgroups: comp.sys.sun > > Archive-name: skill/19-May-90 > Original-posting-by: earle@poseur.jpl.nasa.gov (Greg Earle - Sun JPL Software Support) > Original-subject: Re: 4.1 patches for Kernel sensitive programs (sps,fstat,ofi > Archive-site: snake.utah.edu [128.110.4.58] > Archive-directory: pub > Archive-files: skill_2.6_shar > Reposted-by: emv@math.lsa.umich.edu (Edward Vielmetti) > > In Sun-Spots you write: > >X-Sun-Spots-Digest: Volume 9, Issue 165, message 10 > > > >Has anyone worked out 4.0 -> 4.1 patches for the usual kernel sensitive > >programs like sps, fstat, and ofiles? > > None of those three, but version 2.6 of `skill'/`snice' (Jeff Forys' > program to kill or renice processes based on user name, process name, or > tty name or any combination thereof) is now available with 4.1 support. > Use anonymous FTP to snake.Utah.EDU and retrieve pub/skill_2.6_shar. > > I have not tried to tackle the others, such as `top', `sps', `ofiles' et al.

Cheers, Rich

-------- REPLY -------- From: Bob Devonshire (SSGSD OL-B/SDTO) <devon@ssgsd-www.tinkernet.af.mil>

Sanjay, I use the following script to kill unwanted processes. Of course, in my case, I know what process I'm looking for in advance. You might be able to adapt the script for your specific use, maybe search for a process with a certain amount of time or a specific user, etc. I run the script from my "root" cron as follows: 0 23 * * * /usr/lbin/stop-time1 > /dev/nul 2>&1

These processes are left over from another script I use to log out idle users (over 15 minutes with no activity) and sometimes the processed don't go away. Let me know if I can be of further service,

-Bob-

///////////////////////////////

# # STOP-TIME1 # Bob Devonshire - 11/12/93 # PIDS() { pids=`/bin/ps -ef | /usr/bin/egrep "time1" | /usr/bin/cut -c10-14` } TIME1() { echo "\n\tKilling TIME1 daemons. Please wait ... \n" for time1 in `ps -ef | egrep time1 | egrep -v "egrep time1" | cut -c10-14 | sed -e 's/ //g'` do kill -9 $time1 >/dev/null 2>/dev/null done } # # Main Module # echo "\tStopping the Time1 daemon ... \c" PIDS if [ "$pids" ] then TIME1 echo "\t\tDone!!!\n" else echo "\n\n\tThere were no TIME1 processes to stop ... \n" fi

----------------- END OF REPLIES ----------------

HERE is my ORIGINAL REQUEST: > > Hello! > > I have an unusual request which I hope somebody will oblige. > > We have a server which is providing X-sessions to several X-terminals > and PCs running Exceed, on Win3.x, Win95 & WinNT. We often have > some processes become orphaned and just chug along taking up a lot > of CPU. I have come to identify these processes as those which > have a PPID (parent PID) as 1 and when I do a ps -ef, I also see > that under TIME column, the time is well above 200:00. Only > runaway proccesses take up this much CPU time (from what I have > seen so far, in our environment!). > > For example, when I do a ps -ef | grep <program>, I see, > > UID PID PPID C STIME TTY TIME COMMAND > sgowda 2234 1 0 Dec 31 ? 220:29 <program> > > I need to write a script, a cron job that runs every hour, and kill > all processes (excluding userids - root, daemon, etc), which > exhibit the above characteristics. Can I please request some example > scripts (using perl, sh, awk, etc) to do this? Do you know of a > mailing-list for discussions on writing shell scripts, c programs, > which I can call upon to help write scripts, etc, if this is not > the proper place to discuss this. > > Ofcourse, I will summarize. > > Thank you. > > Sanjay >



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:11:03 CDT