NAME
Bio::Grid::Run::SGE - Distribute (biological) analyses on the local SGE grid
SYNOPSIS
Bio::Grid::Run::SGE lets you distribute a computational task on cluster nodes.
Imagine you want to run a pipeline (a concatenation of several tasks) in parallel. This is usually no problem, you have plenty of frameworks to do this. However, if ONE task of your pipeline is so big that it is necessary to split it up into multiple subtasks, then Bio::Grid::Run::SGE may be extemely useful.
A simple example would be to calculate the reverse complement of 10,000,000,000,000,000,000 sequences in a FASTA file in a distributed fashion.
To run it with Bio::Grid::Run::SGE, you need
- a. A cluster script that executes the task (or calls a 2nd script that executes the task)
- b. A job configuration file in YAML format
- c. Input data
On the commandline this looks like:
$ perl ./cl_script_with_task.pl job_configuration.conf.yml
To continue with the example of the reverse complement (don't worry, the example does not use 10,000,000,000,000,000,000 sequences, but only the human CDS sequences):
First, create a perl script cl_reverse_complement.pl that executes the analysis in the Bio::Grid::Run::SGE environment.
#!/usr/bin/env perl
use warnings;
use strict;
use 5.010;
use Bio::Grid::Run::SGE;
use Bio::Gonzales::Seq::IO qw/faslurp faspew/;
run_job(
task => sub {
my ( $c, $result_file_name_prefix, $input) = @_;
# we are using the "General" index, so $input is a filename
# containing some sequences
# read in the sequences
my @sequences = faslurp($input);
# iterate over them and
for my $seq (@sequences) {
$seq->revcom;
# calculate the reverse complement
}
# finally write the sequences to a results file specific for the current job
faspew( $result_file_name_prefix . ".fa", @sequences );
# return 1 for success (0/undef for error)
return 1;
}
);
1;
Second, download sequences and create a config file rc_human.conf.yml (YAML format) to specify file names and pipeline parameters.
The example uses human CDS sequences, you have to download them from, e.g. Ensembl:
$ wget ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/cds/Homo_sapiens.GRCh37.75.cds.all.fa.gz
Currently, Bio::Grid::Run::SGE does not have an index to support on the fly decompression for input data, so you have to do it on your own:
$ gunzip Homo_sapiens.GRCh37.75.cds.all.fa.gz
Now you can create the config file rc_human.conf.yml (YAML format) and paste the following text into it:
---
input:
# use the Bio::Grid::Run::SGE::Index::General index
# to index the sequence files
- format: General
# supply 100 sequences in one chunk
# ($input in the cluster script contains 100 sequences)
chunk_size: 100
# an array of one or more sequence files
files: [ 'Homo_sapiens.GRCh37.75.cds.all.fa' ]
# fasta headers start with '>'
sep: '^>'
job_name: reverse_complement
# iterate consecutively through all sequences
# and call cl_reverse_complement.pl on it
mode: Consecutive
Third, with this basic configuration, you can run the reverse complement distributed on the cluster by invoking
perl cl_reverse_complement.pl rc_human.conf.yml
The results will be in reverse_complement.result
There are a lot more options, indices and modes available, see "DESCRIPTION" for more info.
INSTALLATION
- 1. Install Bio::Grid::Run::SGE from CPAN
-
The tests that are run during installation are expecting qsub and qstat executables. The tests might fail if you don't have them. Just skip the tests in this case.
- 2. Try the stuff in "SYNOPSIS"
DESCRIPTION
Control flow in Bio::Grid::Run::SGE
The general flow starts at running the cluster script. The script defines an index and an iterator. Indices describe how to split the data into chunks, whereas iterators describe in what order these chunks get fed to the cluster script.
Once the script is started, pre tasks might be run and the index is set up. You have to confirm the setup to start the job on the cluster. Bio::Grid::Run::SGE is submitting then the cluster script as array job to the cluster. After the job is finished, post tasks, if specified, are run.
Output is stored in the result folder, intermediate files are stored in the temporary folder. The temporary folder contains the log, scripts to rerun failed jobs, update the job status, standard error and output, files containing data chunks and additional log information.
DOCUMENTATION CONTENTS
- Writing cluster scripts
- Writing configuration files
- Using indices
- Using iteration modes
- Job logging
- Job state notifications
- Running other (e.g. Python) scripts
INCLUDED 3RD PARTY SOFTWARE
To show running time of jobs, distribution was used. The script is distributed under GPL, so honor that if you use this package. I personally have to thank Tim Ellis for creating such an nice script.
SEE ALSO
Bio::Gonzales Bio::Grid::Run::SGE::Util
AUTHOR
jw bargsten, <joachim.bargsten at wur.nl>