| Matlab Distributed Computing Engine (DCE) |
IntroductionThe Matlab Distributed Computing Engine (DCE) allows users to submit two types of jobs: simple distributed (i.e., multiple single cpu jobs) and complex parallel (i.e., mpi or matlab parallel). For a full reference, please visit Mathworks. HPC Matlab ResourcesThere is a 16-seat license for the DCE, so a Matlab user can run up to 16 tasks (parallel or simple) if available. Each headnode has a 2-seat license for the interactive Matlab client that is used to submit jobs to the DCE using the shared filesystem. **IMPORTANT NOTE**A common problem users encounter is a misunderstanding of the matlab search path. When you run a matlab function using the DCE, the function must be somewhere matlab can find it. Matlab adds YOURHOMEDIR/matlab to the search path by default (but not subdirectories under it). If you wish to add other directories and subdirectories to search, you can create a startup.m file in the ~/matlab directory and add paths to the file with: addpath ~/SOMEWHERE/IMPORTANT Or to include subdirectories under the path: addpath(genpath('/SOMEWHERE/IMPORTANT/'))
The error that is returned from the submission if your function can't be found is: Undefined function or method 'YOURFUNCTION' for input arguments of type 'double' Cluster Matlab Job SubmissionTo submit either a simple or complex job to the FSU Matlab DCE, you call the same function: fsuClusterMatlab This function is a customized wrapper that will setup the Matlab environment based on a number of arguments (at the Matlab prompt, type "help fsuClusterMatlab"):
>> help fsuClusterMatlab
Setup jobs to run on the FSU cluster Matlab Distributed
Computing Engine (DCE) FSUCLUSTERMATLAB(outputdir,
moabopts, jobtype, waitforresults, numworkers, jobfunc, ...
jobfuncargs) returns the results of the job as a cell and
takes optional arguments of:
outputdir = Directory to store output and intermediate
files. Defaults to current directory.
E.g., '~/matlab/jobs' format allowed.
moabopts = Options to pass to moab, i.e., walltime
(defaults to 10 days), queue, qos, etc.
E.g., '-l walltime=4:00:00, -l qos=coaps_high'
jobtype = (s)imple, (p)ool arallel, (m)pi parallel
Defaults to (s)imple (multiple separate jobs)
(p)ool parallel is the matlabpool that allows the
use of parfor loops (m)pi parallel can harness
the labindex/labnumber functionality
waitforresults = (w)ait or (n)owait
Defaults to waitForState which blocks execution in
the client session until the job finishes (results
are returned) or fails.
numworkers = Number of separate matlab workers to
start (cpus). Defaults to 1.
jobfunc = The matlab function to pass to the DCE.
jobfuncargs = The (optional) input arguments to your jobfunc.
The Matlab session must be invoked with the jvm to work correctly.
Job Types Demystified (sort of)There are three distinct job types that you must choose when you submit a job to the cluster.
The two parallel job types and their functionality cannot be intermixed. If your code uses parfor loops and labindexing and you submit using the pool parallel job type, the code will never exit the function and will only time out when the scheduler kills it. Matlabpool Parallel versus MPI ParallelThe Matlabpool can be viewed as a higher level parallel environment - you do not control the message passing, Matlab will do it automatically with their parallel functions. Only one copy of your function is run on a "client" session that takes care of everything for you in the background. The mpi parallel job type is a lower level parallel environment - you control the message passing. Each lab of your workers receives a copy of your function and you determine how your code is parallelized based on the number of workers and labindexes. There is one set of functions that can be mixed between the two parallel paradigms, distributed functions. But from the tests I have run, the results are vague and misleading how to implement the distributed functions correctly between the matlabpool and the mpi parallel environment. Both will distribute arrays but in the matlabpool environment, you have no access to the localparts of the arrays and many times, simple tests fail to return any results. In the mpi parallel environment, you have access to each local array part based on the labnum of the worker, but if you attempt to gather the array, each worker gets a full copy of the array. Best advice is to test your function in each environment to see how the code will behave. Prototyping Your Parallel FunctionsBefore you start submitting your new functions to the cluster and wait in the queue for your job only to return with errors, you can prototype your code on your desktop (if you purchased the parallel computing toolbox) or on the login nodes with up to four workers, interactively or non-interactively. To open up an interactive window with 2 - 4 workers, type the following at the Matlab command prompt: pmode start local 4 To start 2 - 4 background workers on the login node, type the following at the Matlab command prompt: matlabpool local 4 Please note that there is a limitation with prototyping because not everything you do in Matlabpool or pmode using the Parallel Computing Toolbox translates to the same result on the cluster (although it should). With pmode, you can use both parfor and labindex functions (although there is no current way to implement this submitting Matlab jobs to the cluster). Example Job SubmissionThe fsuClusterMatlab function can be called with any number of arguments, but if you want the default argument, you must leave an empty vector as a placeholder. It is beneficial to pass a separate directory for job output because the DCE creates multiple '.mat' files and a job subdirectory that holds task state, logs, and output. Please ensure that the function you are sending into fsuClusterMatlab is in your Matlab path. The simplest place to put your function is in '~/matlab' directory (automatically included in your Matlab path). If your submission fails or the results are unexpected, read the log file. Matlab creates a directory structure of input, output, and log files under a job number directory, e.g.: cd ~/matlab/jobs/Job22/ cat Job22.Task1.log License checkout failed. License Manager Error -4 Maximum number of users for MATLAB_Distrib_Comp_Engine reached. Try again later. To pass arguments into your function, there are three choices:
Examples of function use: fsuClusterMatlab(outputdir, walltime, jobtype, waitforresults, numworkers, jobfunc, jobfuncargs)
**NOTE** If you choose not to wait for your results (waiting locks up your current matlab client session), you have to manually retrieve them by going to the directory you submitted as your output directory, changing into the "JobNumber" directory, then load the TaskNumber.out.mat file. For Example (from the matlab prompt):
cd ~/matlab/jobs/Job141
load Task1.out.mat
whos
Name Size Bytes Class Attributes
results 1x1 912 cell
celldisp(results)
results{1} =
18 25 2 9 16
24 6 8 15 17
5 7 14 21 23
Other Job Submission OptionsWhile Matlab does provide the necessary scripts to submit jobs from your desktop Matlab session to a remote location, we recommend and support using Matlab on the headnodes to simplify the process. To submit to the DCE, the Matlab versions have to match (no backwards compatibility), a few files must be modified, passwordless access must be set up, firewall rules have to be modified, and separate toolboxes must be purchased (parallel computing). We cannot troubleshoot or oversee any of these issues so we recommend submitting from your login node. Known Errors & LimitationsIn the log file output there is an expected error that can be ignored, the MATLAB core dump. The core dump is a debug message that will be turned off in the a future release: MATLAB core dump: Exit on fatal error (no core) enabled.A limitation in the current Matlab parallel submissions is the inability to mix use of the "Matlab parallel" calls, i.e., labindex, labnum, labBroadcast, etc., with the use of the new automatically distributed parfor call. This limitation will hopefully be fixed in upcoming releases. If you do image processing in Matlab and wish to parallelize the code, please note that both the interactive (pmode) and noninteractive (cluster) environments start with "-noFigureWindows" option, so much of the printing code behaves differently. There are a few workarounds but do not always perform as expected. The Matlab DCE also does not communicate over the Infiniband high speed network on the cluster. This should not impact the performance of your parallel code too much because Matlab's main parallel scheme is independent distribution (i.e., the parfor loop distribution relies on the independence of the loop variables). Possibly in future releases we can incorporate the high speed network. |



