NAME
HPCD::SLURM::Run
SYNOPSIS
use HPCD::SLURM::Run;
DESCRIPTION
This module helps execute srun, scancel and sacct.
srun: This module puts together user input for stage attributes and submits the srun command for
the system to run.
sacct: This module contains a method to parse account information given
back by the sacct command.
scancel: This module executes the scancel command when the job needs to be killed.
ATTRIBUTE
stats_keys
stats_keys record the names of the account information keys. Default to @job_info_fields.
METHODS
after '_collect_job_stats'
Calls the subroutine _collect_sacct_info, updates the hash in the stats attribute
with the exit status and accounting info.
_collect_sacct_info
Executes the command 'sacct --name $id --format=$key' n times, where $key is a singular key
returned by sacct -e and n is the number of keys returned by sacct -e. Maps $key and the
returned account information value in hash %info and returns %info as the result.
The reason why sacct is called multiple times instead of once (by sacct -name $id --format=
$key1,$key2,$key3...) is that sometimes the value field might be blank, e.g. the result of
the command sacct -name $id --format=Account,User,Comment,ReqMem might be
Account User Comment ReqMem
----------------------------------
jdoe 2Gn
It is thus difficult to parse the values and map them with corresponding keys. By calling
--format=$key separately for each $key value each time, we can catch all the blank values and
ensure that the key-value pairs are matched correctly.
around 'soft_timeout'
Replaces the original 'soft_timeout' method in HPCI, and cancels the job directly.
around 'hard_timeout'
Cancels the job before calling the original HPCI method 'hard_timeout'.
The original hard_timeout sends a kill signal to the child process. In this case,
that is the "srun" program, not the actual child job (which is on some other
computer so kill cannot be used). The sleep and continue with sending the kill
signal at least cleans up the local process if the cancellation does not work
properly. Usually it will, and the kill will be sent to a process that has terminated
already.
_delete_job
Terminates the job by calling scancel -n $id.
_to_MB
A subroutine which converts any memory value in unit KMGT to a number in MB,
since the srun --mem= option takes only a number which by default is in MB.
Example:
$self->_to_MB('2G') would return 2048
$self->_to_MB('100M') would return 100
_reformat_time
A subroutine which reformats the input $sec (a number in seconds) to either
minute:second, hour:minute:second, or day-hour:minute:second, which are the
formats acceptable by the srun --time= option.
Example:
$self->_reformat_time(1) would give '0:1'
$self->_reformat_time(70) would give '1:10'
$self->_reformat_time(3601) would give '1:0:1'
$self->_reformat_time(86400) would give '1-0:0:0'
_res_value_map
A subroutine which reformats key and value in stage attribute resources_required
to the option format acceptable by srun.
Example:
If the key and value in resources_required is 'mem' and '3G', then
_res_value_map would give '--mem=3072' as the output.
_get_mapped_resources_string
A subroutine which maps parameters in resources_required to a string of srun
options.
Example:
Say resources_required is {"mem" => "100M", "h_time" => 1000}, then the output
will be '--mem=100 --time=16:40'.
_get_submit_command
A subroutine which incorporates attributes of one certain stage (i.e. shell_script,
unique name, stdout, stderr, native_args_string, resrouces_required) into one single
srun command for the system to execute.
Example:
If the stage has its script_file named script.sh, unique_id being NAME12345,
native_args_string being "-N 2 -n 4 --mail-type=ALL --mail-user=jdoe@xyz.com",
resources_required being {"mem" => "5G", "h_time" => 200}, then the output of this
subroutine would be 'srun -N 2 -n 4 --mail-type=ALL --mail-user=jdoe@xyz.com
--mem=5120 --time=3:20 -J NAME12345 -o someoutputpath -e someerrorpath script.sh'.
AUTHOR
John Macdonald - Boutros Lab
Anqi (Joyce) Yang - Boutros Lab
ACKNOWLEDGEMENTS
Paul Boutros, Phd, PI - Boutros Lab
The Ontario Institute for Cancer Research