Configuration
Bio::Grid::Run::SGE uses various configuration settings to run a job. All configuration is stored in the YAML format. The configuration can be stored at two places:
- 1. In a global configuration file located at ~/.bio-grid-run-sge.conf.
- 2. In a per-job configuration file supplied as argument to the cluster script.
Creating a global config file
The global config file can contain e.g. settings that are used for job notifications and paths to executables.
Job notification
Bio::Grid::Run::SGE can notify you if a job finishes by email or jabber message. You can also use a custom script with the script
option. An example configuration would be:
---
notify:
mail:
dest: person.in.charge@example.com
smtp_server: smtp.example.com
jabber:
jid: grid-report@jabber.example.com/grid_report
password: ...
dest: person-in-charge@jabber.example.com
script: /path/to/log/script.pl
Custom scripts will get a json encoded structure passed via stdin. The structure has the form:
{
"subject": "the subject",
"message": "the main log message",
"from": "user@the.cluster.org"
}
Other global configuration
You can add other configuration settings. If you start a lot of R scripts you might want to add the Rscript bin as global configuration:
---
notify:
....
r_script_bin: /usr/bin/Rscript
This configuration setting is accessible in the cluster script via the supplied configuration of the task function as $c-
{extra}{r_script_bin}>.
Creating a job-specific config file
job_name: NAME
mode: Consecutive/AvsB/AllvsAll/AllvsAllNoRep
args: [ '-a', 10, '-b','no' ]
test: 2
no_prompt: 1
parts: 3000
# or
combinations_per_job: 300
result_dir: result_gff
working_dir:
stderr_dir:
stdout_dir:
log_dir: dir
tmp_dir: dir
idx_dir: dir
prefix_output_dirs:
path specifictation in the config file
If the config file contains relative paths, the following policy is used:
- 1. The
working_dir
config entry is used as "root". - 2. If no
working_dir
config entry is specified, the directory of the config file is set to the working/root dir. - 3. If no config file is specified (yes, this is possible, but not recommended), the current dir is used as working/root dir.
The working directory needs to exist.
The input section
With the input section it is possible to specify the type of input data and how the index should be created.
The basic layout is:
---
input:
- ... # details of index 1
- ... # details of index 2
Each index element shows up as argument in the task
function,
run_job(...
task => sub {
my ( $c, $result_prefix, $element_index_1, $element_index_2, ... ) = @_;
}
...
)
The number of indices you can use is determined by the mode. The most basic mode is Consecutive
and it takes one index and iterates through every element.
Index types
Running mode
Bio::Grid::Run::SGE can run in different iteration modes
=
---
input:
- format: General
#files, list and elements are synonyms
files:
- ../03_clean_evidence/result/merged.fa.clean
chunk_size: 30
sep: ^>
sep_remove: 1
sep_pos: '^'/'$'
ignore_first_sep: 1
- format: List
list: [ 'a', 'b', 'c' ]
- format: FileList
files: [ 'filea', 'fileb', 'filec' ]
- format: Range
list: [ 'from', 'to' ]
RESEVED CONFIGURATION OPTIONS
Example configuration:
'stdout_dir' => '/WORKING_DIR/xml_munge1.tmp/out',
'test' => '1',
'no_prompt' => undef,
'input' => [
{
'elements' => [
'../../2013-10-13_string_b2g_blast/cafa_b2g_blastSTRING_9606_protein.sequences.result/cafa_b2g_blastSTRING_*_protein.sequences.*.blast.gz'
],
'format' => 'FileList',
'idx_file' => '/WORKING_DIR/idx/xml_munge1.0.idx'
}
],
'mode' => 'Consecutive',
'range' => [ '1', '1' ],
'submit_bin' => 'qsub',
'submit_params' => [],
'args' => [],
'working_dir' => '/WORKING_DIR/test',
'num_comb' => 564,
'log_dir' => '/WORKING_DIR/xml_munge1.tmp/log',
'stderr_dir' => '/WORKING_DIR/xml_munge1.tmp/err',
'tmp_dir' => '/WORKING_DIR/xml_munge1.tmp',
'smtp_server' => 'net.wur.nl',
'job_name' => 'xml_munge1',
'extra' => { 'map' => '../split_test.map.json.gz' },
'mail' => 'joachim.bargsten@wur.nl',
'script_dir' => '/WORKING_DIR/bin',
'idx_dir' => '/WORKING_DIR/idx',
'job_cmd' =>
'qsub -t 1-1 -S perl -N xml_munge1 -e /WORKING_DIR/xml_munge1.tmp/err -o /WORKING_DIR/xml_munge1.tmp/out /WORKING_DIR/xml_munge1.tmp/env.xml_munge1.pl WORKING_DIR/bin/cl_xml_munge.pl --worker /WORKING_DIR/xml_munge1.tmp/xml_munge1.config.dat',
'job_id' => '325541.1',
'cmd' => [ '/WORKING_DIR/bin/cl_xml_munge.pl' ],
'_worker_config_file' =>
'/WORKING_DIR/xml_munge1.tmp/xml_munge1.config.dat',
'prefix_output_dirs' => '1',
'perl_bin' => '/home/cafa/perl5/perlbrew/perls/perl-5.16.3/bin/perl',
'result_dir' => '/WORKING_DIR/xml_munge1.result',
'part_size' => 1,
'parts' => 564
Here is a list of reserved configuration options:
$c = {
cmd => ...,
script_dir => ...
no_post_task => ...,
tmp_dir => ...,
stderr_dir => ...,
stdout_dir => ...,
result_dir => ...,
log_dir => ...,
idx_dir => ...,
test => ...,
mail => ...,
smtp_server => ...,
no_prompt => ...,
lib => ...,
input => ...,
extra => ...,
parts => ...,
combinations_per_job => ...,
job_name => ...,
job_id => ...,
mode => ...,
_worker_config_file => ...,
_worker_env_script => ...,
submit_bin => ...,
submit_params => ...,
perl_bin => ...,
working_dir => ...,
iterator => ...,
args => ...,
};
input section
---
input:
- format: General
#files, list and elements are synonyms
files:
- ../03_clean_evidence/result/merged.fa.clean
chunk_size: 30
sep: ^>
sep_remove: 1
sep_pos: '^'/'$'
ignore_first_sep: 1
- format: List
list: [ 'a', 'b', 'c' ]
- format: FileList
files: [ 'filea', 'fileb', 'filec' ]
- format: Range
list: [ 'from', 'to' ]
job_name: NAME
mode: Consecutive/AvsB/AllvsAll/AllvsAllNoRep
args: [ '-a', 10, '-b','no' ]
test: 2
no_prompt: 1
parts: 3000
# or
combinations_per_job: 300
result_dir: result_gff
working_dir:
stderr_dir:
stdout_dir:
log_dir: dir
tmp_dir: dir
idx_dir: dir
prefix_output_dirs:
The attribute args
is special, normally the main executable is hard-coded in the cl_* script, but the arguments are changing per configuration. Therefore Bio::Grid::Run::SGE::Master provides the convenience attribute $c->{args}