NAME
Bio::Grid::Run::SGE - Distribute (biological) analyses on the local SGE grid
SYNOPSIS
You want to distribute computational tasks on the cluster nodes. A simple example would be to calculate the reverse complement of 10,000,000,000,000,000,000 sequences in a FASTA file in a distributed fashion.
First, create a perl script cl_reverse_complement.pl that executes the analysis in the Bio::Grid::Run::SGE environment.
use Bio::Grid::Run::SGE;
use Bio::Gonzales::Seq::IO qw/faslurp faspew/;
run_job(
{
task => sub {
my ( $c, $result_file_name_prefix, $input) = @_;
# we are using the "General" index, so $input is a filename
# containing some sequences
# read in the sequences
my @sequences = faslurp($input_file_name);
# iterate over them and
for my $seq (@sequences) {
$seq->revcom;
# calculate the reverse complement
}
# finally write the sequences to a results file specific for the current job
faspew( $result_file_name_prefix . ".fa", @sequences );
# return 1 for success (0/undef for error)
return 1;
},
}
);
exit;
Second, create a config file conf.yml (YAML format) to specify file names and pipeline parameters.
---
input:
# use the Bio::Grid::Run::SGE::Index::General index
# to index the sequence files
- format: General
# an array of one or more sequence files
files: [ 'sequences.fa' ]
# fasta headers start with '>'
sep: '^>'
job_name: reverse_complement
# iterate consecutively through all sequences
# and call cl_reverse_complement.pl on it
mode: Consecutive
Third, with this basic configuration, you can run the reverse complement distributed on the cluster by invoking
perl cl_reverse_complement.pl conf.yml
There are a lot more options, indices and modes available, see DESCRIPTION for more info.
INSTALLATION
- 1. Install Bio::Grid::Run::SGE from CPAN
- 2. create a global config file $HOME/.bio-grid-run-sge.conf
- 3. Do the stuff in "SYNOPSIS"
DESCRIPTION
The general flow starts at running the cluster script. The script defines an index and an iterator. Indices describe how to split the data into chunks, whereas iterators describe in what order these chunks get fed to the cluster script.
Once the script is started, pre tasks are run and the index is set up. You have to confirm the setup to start the job on the cluster. Bio::Grid::Run::SGE is submitting then the cluster script as array job to the cluster.
Output is stored in the result folder, intermediate files are stored in the temporary folder. The temporary folder contains scripts to rerun failed jobs, update the job status, standard error and output, files containing data chunks and additional log information.
Bio::Grid::Run::SGE SCRIPT FILE STRUCTURE
Run stuff before the job is started (pre_task)
Run the job (task)
Input data
Run stuff after the job finished (post_task)
INPUT INDICES
ITERATION MODES
CONFIGURATION FILES
input section
---
input:
- format: General
#files, list and elements are synonyms
files:
- ../03_clean_evidence/result/merged.fa.clean
chunk_size: 30
sep: ^>
sep_remove: 1
sep_pos: '^'/'$'
ignore_first_sep: 1
- format: List
list: [ 'a', 'b', 'c' ]
- format: FileList
files: [ 'filea', 'fileb', 'filec' ]
- format: Range
list: [ 'from', 'to' ]
job_name: NAME
mode: Consecutive/AvsB/AllvsAll/AllvsAllNoRep
args: [ '-a', 10, '-b','no' ]
test: 2
no_prompt: 1
parts: 3000
# or
combinations_per_job: 300
result_dir: result_gff
working_dir:
stderr_dir:
stdout_dir:
log_dir: dir
tmp_dir: dir
idx_dir: dir
prefix_output_dirs:
INCLUDED 3RD PARTY SOFTWARE
To show running time of jobs, distribution was used. The script is distributed under GPL, so honor that if you use this package. I personally have to thank Tim Ellis for creating such an nice script.
SEE ALSO
Bio::Gonzales Bio::Grid::Run::SGE::Util
AUTHOR
jw bargsten, <joachim.bargsten at wur.nl>