Matlab Distributed Computing Engine (DCE)
Article Index
Matlab Distributed Computing Engine (DCE)
Cluster Matlab Job Submission
Job Types Demystified (sort of)
Prototyping Your Parallel Functions
Example Job Submissions
Other Job Submission Options
Known Errors and Limitations
All Pages

Introduction

The Matlab Distributed Computing Engine (DCE) allows users to submit two types of jobs: simple distributed (i.e., multiple single cpu jobs) and complex parallel (i.e., mpi or matlab parallel). For a full reference, please visit Mathworks.

HPC Matlab Resources

There is a 16-seat license for the DCE, so a Matlab user can run up to 16 tasks (parallel or simple) if available. Each headnode has a 2-seat license for the interactive Matlab client that is used to submit jobs to the DCE using the shared filesystem.

**IMPORTANT NOTE**

A common problem users encounter is a misunderstanding of the matlab search path. When you run a matlab function using the DCE, the function must be somewhere matlab can find it. Matlab adds YOURHOMEDIR/matlab to the search path by default (but not subdirectories under it). If you wish to add other directories and subdirectories to search, you can create a startup.m file in the ~/matlab directory and add paths to the file with:

addpath ~/SOMEWHERE/IMPORTANT

Or to include subdirectories under the path:

addpath(genpath('/SOMEWHERE/IMPORTANT/'))

The error that is returned from the submission if your function can't be found is:

Undefined function or method 'YOURFUNCTION' for input arguments of type 'double'

Cluster Matlab Job Submission

To submit either a simple or complex job to the FSU Matlab DCE, you call the same function:

fsuClusterMatlab

This function is a customized wrapper that will setup the Matlab environment based on a number of arguments (at the Matlab prompt, type "help fsuClusterMatlab"):

>> help fsuClusterMatlab
  Setup jobs to run on the FSU cluster Matlab Distributed 
  Computing Engine (DCE) FSUCLUSTERMATLAB(outputdir, 
  moabopts, jobtype, waitforresults, numworkers, jobfunc, ...
  jobfuncargs) returns the results of the job as a cell and 
  takes optional arguments of:
 
    outputdir       = Directory to store output and intermediate 
                      files. Defaults to current directory.
                      E.g., '~/matlab/jobs' format allowed.
    moabopts        = Options to pass to moab, i.e., walltime 
                      (defaults to 10 days), queue, qos, etc.
                      E.g., '-l walltime=4:00:00, -l qos=coaps_high' 
    jobtype         = (s)imple, (p)ool arallel, (m)pi parallel
                      Defaults to (s)imple (multiple separate jobs)
                      (p)ool parallel is the matlabpool that allows the 
                      use of parfor loops (m)pi parallel can harness 
                      the labindex/labnumber functionality
    waitforresults  = (w)ait or (n)owait
                      Defaults to waitForState which blocks execution in
                      the client session until the job finishes (results
                      are returned) or fails.
    numworkers      = Number of separate matlab workers to
                      start (cpus). Defaults to 1.
    jobfunc         = The matlab function to pass to the DCE.
    jobfuncargs     = The (optional) input arguments to your jobfunc.
 
  The Matlab session must be invoked with the jvm to work correctly.

Job Types Demystified (sort of)

There are three distinct job types that you must choose when you submit a job to the cluster.

  1. (s)imple
    • The simple case is straightforward, you submit any number of distinct separate 1 cpu jobs.
  2. (p)ool parallel
    • The first parallel job type uses the Matlabpool universe, which gives your code access to distributed parfor functionality.
  3. (m)pi parallel
    • The second parallel job type uses the Matlab parallel universe, which gives your code access to parallel distribution using labindex/labnum functionality.

The two parallel job types and their functionality cannot be intermixed. If your code uses parfor loops and labindexing and you submit using the pool parallel job type, the code will never exit the function and will only time out when the scheduler kills it.

Matlabpool Parallel versus MPI Parallel

The Matlabpool can be viewed as a higher level parallel environment - you do not control the message passing, Matlab will do it automatically with their parallel functions. Only one copy of your function is run on a "client" session that takes care of everything for you in the background.

The mpi parallel job type is a lower level parallel environment - you control the message passing. Each lab of your workers receives a copy of your function and you determine how your code is parallelized based on the number of workers and labindexes.

There is one set of functions that can be mixed between the two parallel paradigms, distributed functions. But from the tests I have run, the results are vague and misleading how to implement the distributed functions correctly between the matlabpool and the mpi parallel environment. Both will distribute arrays but in the matlabpool environment, you have no access to the localparts of the arrays and many times, simple tests fail to return any results. In the mpi parallel environment, you have access to each local array part based on the labnum of the worker, but if you attempt to gather the array, each worker gets a full copy of the array. Best advice is to test your function in each environment to see how the code will behave.


Prototyping Your Parallel Functions

Before you start submitting your new functions to the cluster and wait in the queue for your job only to return with errors, you can prototype your code on your desktop (if you purchased the parallel computing toolbox) or on the login nodes with up to four workers, interactively or non-interactively. To open up an interactive window with 2 - 4 workers, type the following at the Matlab command prompt:

pmode start local 4

To start 2 - 4 background workers on the login node, type the following at the Matlab command prompt:

matlabpool local 4

Please note that there is a limitation with prototyping because not everything you do in Matlabpool or pmode using the Parallel Computing Toolbox translates to the same result on the cluster (although it should). With pmode, you can use both parfor and labindex functions (although there is no current way to implement this submitting Matlab jobs to the cluster).


Example Job Submission

The fsuClusterMatlab function can be called with any number of arguments, but if you want the default argument, you must leave an empty vector as a placeholder. It is beneficial to pass a separate directory for job output because the DCE creates multiple '.mat' files and a job subdirectory that holds task state, logs, and output. Please ensure that the function you are sending into fsuClusterMatlab is in your Matlab path. The simplest place to put your function is in '~/matlab' directory (automatically included in your Matlab path). If your submission fails or the results are unexpected, read the log file. Matlab creates a directory structure of input, output, and log files under a job number directory, e.g.:

cd ~/matlab/jobs/Job22/
cat Job22.Task1.log

License checkout failed.
License Manager Error -4
Maximum number of users for MATLAB_Distrib_Comp_Engine reached.
Try again later.

To pass arguments into your function, there are three choices:

  1. results = fsuClusterMatlab( ..., @jobfunction, 1, 2, 3 )
    • If this is a single simple job or a parallel job, the arguments will be passed in once
    • If this is multiple simple jobs (numworkers > 1), the same arguments will be passed to each task. E.g., if I request 5 workers, '1,2,3' will be passed five times to five separate jobfunctions.
  2. results = fsuClusterMatlab( ..., @jobfunction, {1,2,3} )
    • Passes a Matlab cell to your job function, same conditions as the previous example.
  3. results = fsuClusterMatlab( ..., @jobfunction, {{1,2,3} {4,5,6} {7,8,9}} )
    • Three separate inputs will be passed to three simple tasks
    • Fails with a parallel job (only one task is created with a parallel job)

Examples of function use: fsuClusterMatlab(outputdir, walltime, jobtype, waitforresults, numworkers, jobfunc, jobfuncargs)

results = fsuClusterMatlab()
  • A simple job with one worker (cpu) and you will be prompted for a job function to pass in
  • Writes job files to current directory
  • Walltime of one hour
  • Waits in the Matlab client for results to be returned
results = fsuClusterMatlab('~/matlab/clustertest',[],'p','n',10,@colsum)
  • A Matlabpool Parallel job with ten workers
  • Writes job files in the user's home ~/matlab/clustertest
  • Walltime of one hour ( Note the blank placeholder [] )
  • Does not wait for results
  • The job function '@colsum' with no arguments is passed in
results = fsuClusterMatlab([],'-l walltime=4:00:00','s',[],3,@parMagic,{{100,5} {200,10} {300,15}})
  • A simple job with three workers
  • Writes job files to current directory
  • Walltime of four hours
  • Waits for results
  • The job function '@parMagic' with three separate arguments passed in (one set for each task)
results = fsuClusterMatlab('~/matlab/jobs',[],'m','n',16,@bigarrays,rand(1))
  • An mpi parallel job with 16 workers (max seats if available)
  • Writes job files to home ~/matlab/jobs directory
  • Default Walltime of one hour
  • Does not wait for results
  • The job function '@bigarrays' with one random number argument passed in

**NOTE** If you choose not to wait for your results (waiting locks up your current matlab client session), you have to manually retrieve them by going to the directory you submitted as your output directory, changing into the "JobNumber" directory, then load the TaskNumber.out.mat file. For Example (from the matlab prompt):

cd ~/matlab/jobs/Job141
load Task1.out.mat
whos
Name         Size            Bytes  Class    Attributes

results      1x1               912  cell

celldisp(results)

results{1} =

  18    25     2     9    16
  24     6     8    15    17
   5     7    14    21    23

Other Job Submission Options

While Matlab does provide the necessary scripts to submit jobs from your desktop Matlab session to a remote location, we recommend and support using Matlab on the headnodes to simplify the process. To submit to the DCE, the Matlab versions have to match (no backwards compatibility), a few files must be modified, passwordless access must be set up, firewall rules have to be modified, and separate toolboxes must be purchased (parallel computing). We cannot troubleshoot or oversee any of these issues so we recommend submitting from your login node.


Known Errors & Limitations

In the log file output there is an expected error that can be ignored, the MATLAB core dump. The core dump is a debug message that will be turned off in the a future release:

MATLAB core dump: Exit on fatal error (no core) enabled.

A limitation in the current Matlab parallel submissions is the inability to mix use of the "Matlab parallel" calls, i.e., labindex, labnum, labBroadcast, etc., with the use of the new automatically distributed parfor call. This limitation will hopefully be fixed in upcoming releases.

If you do image processing in Matlab and wish to parallelize the code, please note that both the interactive (pmode) and noninteractive (cluster) environments start with "-noFigureWindows" option, so much of the printing code behaves differently. There are a few workarounds but do not always perform as expected.

The Matlab DCE also does not communicate over the Infiniband high speed network on the cluster. This should not impact the performance of your parallel code too much because Matlab's main parallel scheme is independent distribution (i.e., the parfor loop distribution relies on the independence of the loop variables). Possibly in future releases we can incorporate the high speed network.