Why is my job deferred or blocked?
When you submit a job to the HPC system, your job requirements are evaluated by a Job Manager and the job will start to run immediately or will be placed in a queue until the resources needed to run this job are availbable. The job is assigned a state, which attempts to describe what is happening to the job submission. The following list describes what each of these states mean.
- as the name implies, your job is running on the system. The number of jobs a user is allowed to run at any one time is determined by the user's group membership. For example, some users belong to owner groups that allow their members to run on as many processors as are available over the processor that they own.
- Idle (Eligible)
- Eligible jobs in the idle state are actively being evaluated by the scheduler and will start to run as soon as a suitable set of nodes is available. We've configured MOAB such that only 20 jobs per user can be in this state at any given time. Limiting the number of eligible idle jobs per user protects the scheduler from becoming overloaded by large job submissions by a single user.
- Deferred/Idle (Blocked)
- Jobs are blocked when a user submits over 20 jobs and 20 jobs remain idle but are still eligible to run. The hold on blocked jobs that are labeled idle is automatically removed after a short period of time. A job is deferred for 10 minutes before a re-match is tried. If no matches are found the job is deferred for 20 minutes before a re-match is tried again. This is repeated and the defer time is increased by a factor of two each time a job is deferred. Users can force a re-match to occur with ( releasehold -a ).
- If your job is in this state, something serious is wrong with the job submission and you will likely need to fix the problem and resubmit the job to get it to run. MOAB tags jobs with this state when the job cannot be run because the requested resources are not available in the system or because the resource manager has repeatedly failed in attempts to start the job. Again, when a job enters this state something is seriously wrong.