Condor Documentation

Introduction

Condor is a high-throughput opportunistic distributed computing environment.  It is specifically tailored to long term and/or batch serial jobs where efficient use of computing resources are preferred to speed of computation.  For example, one might use condor over existing high-performance computing solutions to run numerous serial jobs simultaneously that is known to take days, weeks or months to complete.

Job Submission

Condor jobs may be submitted from any HPC head node or the condor login node, condor-login.hpc.fsu.edu. Any HPC user may login to the condor login node from any of the HPC login nodes using by typing "ssh condor-login". condor-login.hpc.fsu.edu is also accessible directly from any machine on the FSU VPN

The spool directory (/opt/condor/spool) on condor submit nodes holds the job queue and history files for all jobs submitted from a given machine. As a result, disk space requirements limit the number of jobs that can be submitted from our HPC login nodes. Especially if users are submitting jobs with very large executables or image sizes.

The login nodes currently have about 50GB of space available for this spool directory. condor-login.hpc.fsu.edu has 1.5T of space for the spool directory and can accommodate users that need the additional space.

condor-login.hpc.fsu.edu has access to Lustre ONLY. Users will need to copy needed files from Panfs to Lustre when using the condor submit node.

If files being submitted with a condor job or jobs are too slow when being pulled off Lustre from condor-login, users may stage data on condor-login in /opt/condor/stage

All condor utilities should be immediately available upon login (ensure you have not overwritten PATH). Currently, all compute nodes are backed by clusters and workstations at the Department of Scientific Computing and many are available for general access.

Your home directory on the condor-login is on Lustre.  The Panfs file system is not available on the condor-login node, so copy any files that you need from Panfs to Lustre before you leave your HPC login node.

If you experience performance issues submitting jobs from the condor-login node, try moving your large data file from Lustre to /opt/condor/stage on the local condor-login node.  Staging large files on the local condor-login disks will help to reduce network traffic and might help speed up your runs if you are working with very large data files.  Please delete/move your files from this directory after you are done.

Job parameters and requirements are described by a condor description file and submitted using the condor_submit utility (again available on every HPC head node).  A description may describe everything from the required or supported architecture, operating systems, RAM, etc... to program environment and arguments.  Descriptions may even include multiple jobs.

Here is a simple example of a condor job description:

universe = vanilla
executable = /path/to/my/executable
arguments = arg1 arg2 arg3
transfer_input_files = file1, file2, file3
requirements = Arch == "X86_64" && OpSys == "LINUX"
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
queue

This describes a vanilla job that will run on a 64 bit Intel compute node running Linux. A vanilla job runs on a compute node with no special features. There are other possible job universes that provide special features like check-pointing but presently only vanilla is supported.

You may view all queued jobs with condor_q. Likewise, you may queue jobs by user name or job ID (output from condor_submit) with condor_q user|jobid.

Compute node statuses may also be queried with condor_status. A compute node may either be UNCLAIMED, CLAIMED, BUSY, or OWNER. Nodes that are UNCLAIMED are available to run jobs contingent on the maximum number of allowed running jobs.

For more documentation on condor job submission and control, please see the condor documentation at DSC:

http://www.sc.fsu.edu/computing/general-access/batch

Remember that you don't have to login to submit.scs.fsu.edu, but that you can submit jobs from your HPC login node!

Caveats

- Standard Universe is not supported

- To take full advantage of the compute-nodes, executables need to be compiled for 3 different architectures (32 and 64 bit Intel, and 64 bit PowerPC).

- Compute nodes do not presently share the same file system as HPC.

- Condor will notify you of job completion by email. If this is not desirable behavior then add the following to your description:

notification = NEVER
Attachments:
FileFile size
Download this file (Condor_DSC.pdf)Intro to Condor1053 Kb
Download this file (condor_example3.tar.gz)Condor Presentation - Example 32 Kb