Configuration

Bio::Grid::Run::SGE uses various configuration settings to run a job. All configuration is stored in the YAML format. The configuration can be stored at two places:

1. In a global configuration file located at ~/.bio-grid-run-sge.conf.
2. In a per-job configuration file supplied as argument to the cluster script.

Creating a global config file

The global config file can contain e.g. settings that are used for job notifications and paths to executables.

Job notification

Bio::Grid::Run::SGE can notify you if a job finishes by email or jabber message. You can also use a custom script with the script option. An example configuration would be:

---
notify:
  mail:
    dest: person.in.charge@example.com
    smtp_server: smtp.example.com
  jabber:
    jid: grid-report@jabber.example.com/grid_report
    password: ...
    dest: person-in-charge@jabber.example.com
  script: /path/to/log/script.pl

Custom scripts will get a json encoded structure passed via stdin. The structure has the form:

{
  "subject": "the subject",
  "message": "the main log message",
  "from":    "user@the.cluster.org"
}

Other global configuration

You can add other configuration settings. If you start a lot of R scripts you might want to add the Rscript bin as global configuration:

---
notify:
  ....
r_script_bin: /usr/bin/Rscript

This configuration setting is accessible in the cluster script via the supplied configuration of the task function as $c-{extra}{r_script_bin}>.

Creating a job-specific config file

job_name: NAME
mode: Consecutive/AvsB/AllvsAll/AllvsAllNoRep

args: [ '-a', 10, '-b','no' ]
test: 2
no_prompt: 1

num_parts: 3000
# or
combinations_per_task 300

result_dir: result_gff
working_dir:
stderr_dir:
stdout_dir:

log_dir: dir
tmp_dir: dir
idx_dir: dir

prefix_output_dirs: 

path specifictation in the config file

If the config file contains relative paths, the following policy is used:

1. The working_dir config entry is used as "root".
2. If no working_dir config entry is specified, the directory of the config file is set to the working/root dir.
3. If no config file is specified (yes, this is possible, but not recommended), the current dir is used as working/root dir.

The working directory needs to exist.

The input section

With the input section it is possible to specify the type of input data and how the index should be created.

The basic layout is:

---
input:
- ... # details of index 1
- ... # details of index 2

Each index element shows up as argument in the task function,

run_job(...
  task => sub {
    my ( $c, $result_prefix, $element_index_1, $element_index_2, ... ) = @_;
  }
  ...
)

The number of indices you can use is determined by the mode. The most basic mode is Consecutive and it takes one index and iterates through every element.

Index types

General
List
FileList
Range
ListFromFile
Dummy

Running mode

Bio::Grid::Run::SGE can run in different iteration modes

Consecutive
AvsB
AllvsAll
AllvsAllNoRep

=

---
input:
- format: General
  #files, list and elements are synonyms
  files:
  - ../03_clean_evidence/result/merged.fa.clean
  chunk_size: 30
  sep: ^>
  sep_remove: 1
  sep_pos: '^'/'$'
  ignore_first_sep: 1

- format: List
  list: [ 'a', 'b', 'c' ]
  
- format: FileList
  files: [ 'filea', 'fileb', 'filec' ]

- format: Range
  list: [ 'from', 'to' ]

RESEVED CONFIGURATION OPTIONS

Example configuration:

'stdout_dir' => '/WORKING_DIR/xml_munge1.tmp/out',
'test'       => '1',
'no_prompt'  => undef,
'input'      => [
  {
    'elements' => [
      '../../2013-10-13_string_b2g_blast/cafa_b2g_blastSTRING_9606_protein.sequences.result/cafa_b2g_blastSTRING_*_protein.sequences.*.blast.gz'
    ],
    'format'   => 'FileList',
    'idx_file' => '/WORKING_DIR/idx/xml_munge1.0.idx'
  }
],
'mode'          => 'Consecutive',
'range'         => [ '1', '1' ],
'submit_bin'    => 'qsub',
'submit_params' => [],
'args'          => [],
'working_dir'   => '/WORKING_DIR/test',
'num_comb'      => 564,
'log_dir'       => '/WORKING_DIR/xml_munge1.tmp/log',
'stderr_dir'    => '/WORKING_DIR/xml_munge1.tmp/err',
'tmp_dir'       => '/WORKING_DIR/xml_munge1.tmp',
'smtp_server'   => 'net.wur.nl',
'job_name'      => 'xml_munge1',
'extra'         => { 'map' => '../split_test.map.json.gz' },
'mail'          => 'joachim.bargsten@wur.nl',
'script_dir'    => '/WORKING_DIR/bin',
'idx_dir'       => '/WORKING_DIR/idx',
'job_cmd' =>
  'qsub -t 1-1 -S perl -N xml_munge1 -e /WORKING_DIR/xml_munge1.tmp/err -o /WORKING_DIR/xml_munge1.tmp/out /WORKING_DIR/xml_munge1.tmp/env.xml_munge1.pl WORKING_DIR/bin/cl_xml_munge.pl --worker /WORKING_DIR/xml_munge1.tmp/xml_munge1.config.dat',
'job_id' => '325541.1',
'cmd'    => [ '/WORKING_DIR/bin/cl_xml_munge.pl' ],
'worker_config_file' =>
  '/WORKING_DIR/xml_munge1.tmp/xml_munge1.config.dat',
'prefix_output_dirs' => '1',
'perl_bin'           => '/home/cafa/perl5/perlbrew/perls/perl-5.16.3/bin/perl',
'result_dir'         => '/WORKING_DIR/xml_munge1.result',
'part_size'          => 1,
'num_parts'          => 564

Here is a list of reserved configuration options:

$c = {
  cmd => ...,
  script_dir => ...
  no_post_task => ...,
  tmp_dir => ...,
  stderr_dir => ...,
  stdout_dir => ...,
  result_dir => ...,
  log_dir => ...,
  idx_dir => ...,
  test => ...,
  mail => ...,
  smtp_server => ...,
  no_prompt => ...,
  lib => ...,
  input => ...,
  extra => ...,
  num_parts => ...,
  combinations_per_task => ...,
  job_name => ...,
  job_id => ...,
  mode => ...,
  worker_config_file => ...,
  worker_env_script => ...,
  submit_bin => ...,
  submit_params => ...,
  perl_bin => ...,
  working_dir => ...,
  iterator => ...,

  args => ...,
};

input section

---
input:
- format: General
  #files, list and elements are synonyms
  files:
  - ../03_clean_evidence/result/merged.fa.clean
  chunk_size: 30
  sep: ^>
  sep_remove: 1
  sep_pos: '^'/'$'
  ignore_first_sep: 1

- format: List
  list: [ 'a', 'b', 'c' ]
  
- format: FileList
  files: [ 'filea', 'fileb', 'filec' ]

- format: Range
  list: [ 'from', 'to' ]

job_name: NAME
mode: Consecutive/AvsB/AllvsAll/AllvsAllNoRep

args: [ '-a', 10, '-b','no' ]
test: 2
no_prompt: 1

num_parts: 3000
# or
combinations_per_task 300

result_dir: result_gff
working_dir:
stderr_dir:
stdout_dir:

log_dir: dir
tmp_dir: dir
idx_dir: dir

prefix_output_dirs: 

The attribute args is special, normally the main executable is hard-coded in the cl_* script, but the arguments are changing per configuration. Therefore Bio::Grid::Run::SGE::Master provides the convenience attribute $c->{args}