Hi manegers,
After I sent my first summary on my question, I received several
replies. I have checked my system against the suggestions made
by different people and also installed the at patch. The only thing
I didn't do was to run truss because when the problem occured,
it was either at night or nobody around. As I mentioned with my orignal
message that there was no problem afterward for at jobs to generate
another at job (from the same user Id and same program, but the at job
always has different name) and I couldn't use truss to check after
the problem had occured.
Unfortunately a few days after I havd done all these, the same problem
had happened again without leaving any trace for me to check. So I am
still trying to find ways to tackle this problem.
The following are the suggestions from different people. I thank all
of them for helping me out. My original question and summary are at
the end of the message.
Yuan
--------------------------------
>From "Steve Kay" <steve@peachy.com>
run truss
--------------------------------
>From Colin_Melville@mastercard.com
"D. Stewart McLeod" <stewart.mcleod@boeing.com>
install patch 103690-06
------------------------------
>From "Rose, Robert" <Robert.Rose@ag.gov.au>
Have you thought of running truss on cron to see what it's doing at the
time? I realise that this would be difficult to do if the problem is
intermittent or difficult to reproduce, but it might give you a clue as to
where cron is denying the job.
Also, check umasks and permissions / locking on the files associated with
cron and at, particularly the pipe FIFO. Maybe you're having some kind of
race condition or something.
Not sure if any of that will help, maybe it might give you something to work
with or get a few extra brain cells going!
------------------------------
>From Jim Harmon <jharmon@telecnnct.com>
You need to show 2 things:
What is the EXACT command that is trying to issue the
"at"?
Is there another at-job with the SAME NAME in the at-queue
when this command is issued?
-------------------------------
>From eddy@slimepuppy.apple.com (Eddy Fafard )
We had at problems a while ago. From what I remember
there was a problem with .cshrc and openwin/cmdtool.
I think what we did was srtiped the cshrc file to the essentials
and used xterms instead of cmdtools and it worked ..
-------------------------------
>From "O'Neal,Chris" <onealwc@AGEDWARDS.com>
I have some questions which might help you debug your "at" problem:
Do the jobs run correctly when they are entered manually on the command
line?
Has someone recently alais or change the alais to "at"?
Try "unalias at; at $$$$$$$$$" what do you get?
Has these jobs been moved to another machine? If so, is that when the
problem started? If not, have you tried them on a different machine to
see what the results would be?
Does the /var/spool/cron/FIFO file exist? Is it corrupt?
Have all "cron" patches been installed? My understanding is that
Solaris 2.1 thru 2.4 have cron/batch/at problems which SUN has some
patches for.
Are any of the starting directories were "at" jobs are launched being
deleted before "at" job runs? What are the permissions on those
directories?
Is one of the "at" jobs waiting on output from a previous "at" job?
That is, Is it scheduled to run before the previous "at" job is ready
for it to go? If the next job tries to run but cannot do so , because
the previous one hasn't finished , then it gets deferred , but is not
the next job to get run.
Is one of the "at" jobs expecting a particular environment variable
which is no longer set in its csh or other startup/login cfg file?
Can the "at" job find all its commands in the PATH?
What is the "load" on the running machine at the time the "at" jobs run?
Is it over 2?
How many jobs do you have scheduled in your job queue via ALL crons,
batchs, and ats?
What are your "executing job maximum" and "queue execution maximum" set
to?
Are you exceeding these?
Is someone using "atq" to remove multi-jobs per day?
Is "crontab -e" being used during these "at" job runs? This resets cron
each time for each user and can cause "batch" and "at" jobs to hang.
How many jobs (ALL) are scheduled to run at or around the same time as
these "at" jobs? When multiple at jobs are scheduled to occur at a
certain time, and those
at jobs schedule other at jobs, the newly scheduled at jobs may reuse
at job numbers that already exist. This behavior is inconsistence, but
will
happen more frequently on a busier system.
Hope some of the above helps you debug your "at" problems.
Bye,
Chris O'Neal
----------------------------------
>From "Carsten B. Knudsen" <cbk@terma.dk>
This is really a long shot, but since nothing else has helped, you
could consider giving it a try ( I haven't tried it out myself, since I
have no machines available for the purpose):
Have you any idea whether the problem correlates with who is logged
onto the machine at that time? It just might be that if at(1) is
called from an environment where $LOGNAME or $USER is not set, it makes
some silly assumptions about who the caller is - such as who is on the
console... I can see on my own system that at(1) is a set-uid,
root-owned program, so maybe it is not even able to get the right
answer from calling getuid().
Try setting $LOGNAME and $USER - and maybe even $HOME - manually inside
the script calling at(1) and see what happens.
Let me know what you find out.
----- Begin Included Message -----
>From lu Fri May 29 13:41:38 1998
To: sun-managers@ra.mcs.anl.gov
Subject: SUMMARY: at: can't create a job for you
I got three replies for my question, with total of four suggestions:
1. look at at.allow and at.deny files in /usr/lib/cron directory
2. look if cron daemon is running
3. look at cron log file /var/cron/log to see any error message
4. try if another user can submit a job.
Unfortunately none of the answers helped me. There are no error
messages in the cron log file; there are no problems with at.allow
and at.deny setup; the cron daemon was running since the system
was last rebooted about a month again; and another job (with the same
user ID) had submitted another batch job 10 minutes later!
We have been running these batch jobs for a few years. We rarely
saw this kind of message. When we saw this message, only one or two
jobs got affected, the other batch jobs still can submit batch jobs
later.
So I almost believe there are some kinds of bugs in at or cron. Can
anyone give me more suggestions? This problem has haunted me for a
long time. Please help.
My thanks go to:
Dennis Martens <MARTENSD@health.qld.gov.au>
"Rodney C. Marable" <marable@xcom.net>
Stefan Voss <s.voss@terradata.de>
Yuan
>
> Hello,
>
> We have a SPARC10 with OS 2.5.1 running about 20 batch jobs. At any time
> only one batch job is running. This job will queue itself for next run after
> finishing processing data. Occasionally we saw error message:
> "at: can't create a job for you", and the batch job failed to queue itself.
>
> Can any one tell me what's the problem here?
>
> P.S. I have raised the number of concurrent job to 20 in /etc/cron.d/queuedefs
> file.
>
> Thanks,
>
> Yuan
>
----- End Included Message -----
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:12:43 CDT