NAME

Bio::Grid::Run::SGE - Distribute (biological) analyses on the local SGE grid

SYNOPSIS

Bio::Grid::Run::SGE lets you distribute a computational task on cluster nodes.

Imagine you want to run a pipeline (a concatenation of several tasks) in parallel. This is usually no problem, you have plenty of frameworks to do this. However, if ONE task of your pipeline is so big that it is necessary to split it up into multiple subtasks, then Bio::Grid::Run::SGE may be extemely useful.

A simple example would be to calculate the reverse complement of 10,000,000,000,000,000,000 sequences in a FASTA file in a distributed fashion.

To run it with Bio::Grid::Run::SGE, you need

a. A cluster script that executes the task (or calls a 2nd script that executes the task)
b. A job configuration file in YAML format
c. Input data

On the commandline this looks like:

$ perl ./cl_script_with_task.pl job_configuration.conf.yml

To continue with the example of the reverse complement (don't worry, the example does not use 10,000,000,000,000,000,000 sequences, but only the human CDS sequences):

First, create a perl script cl_reverse_complement.pl that executes the analysis in the Bio::Grid::Run::SGE environment.

#!/usr/bin/env perl

use warnings;
use strict;
use 5.010;

use Bio::Grid::Run::SGE;
use Bio::Gonzales::Seq::IO qw/faslurp faspew/;

job->run({
  task => sub {
    my ( $c, $result_file_name_prefix, $input) = @_;

    # we are using the "General" index, so $input is a filename
    # containing some sequences

    # read in the sequences
    my @sequences = faslurp($input);

    # iterate over them and 
    for my $seq (@sequences) {
      $seq->revcom;
      # calculate the reverse complement
    }
    # finally write the sequences to a results file specific for the current job
    faspew( $result_file_name_prefix . ".fa", @sequences );

    # return 1 for success (0/undef for error)
    return 1;
  }
});

1;

Second, download sequences and create a config file rc_human.conf.yml (YAML format) to specify file names and pipeline parameters.

The example uses human CDS sequences, you have to download them from, e.g. Ensembl:

$ wget ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/cds/Homo_sapiens.GRCh37.75.cds.all.fa.gz

Currently, Bio::Grid::Run::SGE does not have an index to support on the fly decompression for input data, so you have to do it on your own:

$ gunzip Homo_sapiens.GRCh37.75.cds.all.fa.gz

Now you can create the config file rc_human.conf.yml (YAML format) and paste the following text into it:

---
input:
# use the Bio::Grid::Run::SGE::Index::General index 
# to index the sequence files
- format: General
  # supply 100 sequences in one chunk
  # ($input in the cluster script contains 100 sequences)
  chunk_size: 100
  # an array of one or more sequence files
  files: [ 'Homo_sapiens.GRCh37.75.cds.all.fa' ]
  # fasta headers start with '>'
  sep: '^>'
job_name: reverse_complement

# iterate consecutively through all sequences 
# and call cl_reverse_complement.pl on it
mode: Consecutive

Third, with this basic configuration, you can run the reverse complement distributed on the cluster by invoking

perl cl_reverse_complement.pl rc_human.conf.yml

The results will be in reverse_complement.result

There are a lot more options, indices and modes available, see "DESCRIPTION" for more info.

INSTALLATION

1. Install Bio::Grid::Run::SGE from CPAN

The tests that are run during installation are expecting qsub and qstat executables. The tests might fail if you don't have them. Just skip the tests in this case.

2. Try the stuff in "SYNOPSIS"

DESCRIPTION

Control flow in Bio::Grid::Run::SGE

The general flow starts at running the cluster script. The script defines an index and an iterator. Indices describe how to split the data into chunks, whereas iterators describe in what order these chunks get fed to the cluster script.

Once the script is started, pre tasks might be run and the index is set up. You have to confirm the setup to start the job on the cluster. Bio::Grid::Run::SGE is submitting then the cluster script as array job to the cluster. After the job is finished, post tasks, if specified, are run.

Output is stored in the result folder, intermediate files are stored in the temporary folder. The temporary folder contains the log, scripts to rerun failed jobs, update the job status, standard error and output, files containing data chunks and additional log information.

Logical parts

DOCUMENTATION CONTENTS

Writing cluster scripts
Writing configuration files
Using indices
Using iteration modes
Consecutive
AvsB
AllvsAll
AllvsAllNoRep
Job logging
Job state notifications
Running other (e.g. Python) scripts

INCLUDED 3RD PARTY SOFTWARE

To show running time of jobs, distribution was used. The script is distributed under GPL, so honor that if you use this package. I personally have to thank Tim Ellis for creating such an nice script.

SEE ALSO

Bio::Gonzales Bio::Grid::Run::SGE::Util

AUTHOR

jw bargsten, <joachim.bargsten at wur.nl>