NAME

Bio::Gonzales::Project::Functions - organize your computational experiments

SYNOPSIS

Inspired by A Quick Guide to Organizing Computational Biology Projects this module makes it easy to organise computational biology projects.

$ gonzp init human_genome
$ cd human_genome/analysis

$ gonzp analysis genome_assembly
$ cd genome_assembly

# set up scripts, Makefile, etc.
# ...

$ make human_genome_assembly


$ gonzp analysis genome_annotation # finds the project directory automatically
$ cd ../genome_annotation

# set up scripts, Makefile, etc.
# ...

$ make human_genome_annotation

DESCRIPTION

Project Layout

Create it with gonzp init <project_name>

A project consists of a root directory, containing everything, the paper-draft, analyses, 3rd-party documentation (and perhaps literature), scripts, etc. The whole system is based on Makefiles (to start the different analysis steps) and perl modules (surprise, surprise!!).

The documentation goes into the README file, in whatever format (plain text, markdown, textile, ...) you prefer.

Thus, the basic layout is of an example project is:

example/Makefile  (a Makefile to start single analyses)
example/README    (a overview documentation of the computational experiment)

example/analysis/ (all analyses go in here)

example/data/     (3rd-party data, such as the uniprot database or 
                   experimental results, common to the whole computational
                   experiment go in here)

example/paper/    (the paper draft goes in here)

example/docs/     (3rd-party documentation)

example/lib/      (if some scripts or analyses have a lot in common,
                   creating a module/library might be helpful)

analysis

Create it with gonzp analysis <analysis_name>

The analysis directory contains all analyses that have been done. One directory per analysis. The layout in example/analysis is therefore:

./important_computational_experiment/Makefile     (the Makefile to start single analysis steps)
./important_computational_experiment/av           (the analysis version)
./important_computational_experiment/README       (some analysis-specific documentation)
./important_computational_experiment/gonz.conf.yml (configuration stuff, e.g. file locations or parameters)
./important_computational_experiment/2014-01-28/  (the analysis directory derived from the version stored in "av")
./important_computational_experiment/data/        (analysis-specific data)
./important_computational_experiment/playground/  (here you can try stuff)
./important_computational_experiment/bin/         (a directory to store the scripts)

the analysis version

The analysis version is just a single string and defaults to the day the analysis was created. The contents of the av file are e.g.:

$ cat important_computational_experiment/av
2014-01-28

Cange it to whatever you want. A common use case is to change input data or parameters without clobbering the previous results. Therefore, change the analysis version to a different date and rerun the whole analysis.

The analysis version is integral part of Bio::Gonzales::Project::Functions and therefore accessible via

The Makefile

as $(AV) variable.

Via an exported function of Bio::Gonzales::Project::Functions, nfi($filename)

For example you want to calculate the average number of leaves for 4 plant accessions. You have 3 replicates, so 12 records:

Input data data/leaves.txt:

accession num_leaves
ACC_001 3
ACC_001 4
ACC_001 6
ACC_002 8
ACC_002 14
ACC_002 12
ACC_003 18
ACC_003 10
ACC_003 12
ACC_004 10
ACC_004 4
ACC_004 7

Script bin/calc_number_of_avg_leaves.pl

   #!/usr/bin/env perl
   # created on 2014-01-28
   
   use warnings;
   use strict;
   use 5.010;
   
   use Bio::Gonzales::Project::Functions;
   use List::Util qw(sum);
   
   # read in some raw data
   open my $fh, '<', 'data/leaves.txt' or die "Can't open filehandle: $!";
   
   my %num_leaves;
   
   <$fh>;    # get rid of the header
   while ( my $line = <$fh> ) {
     chomp $line;
     my ( $acc, $num_leaves ) = split /\t/, $line;
   
     push @{ $num_leaves{$acc} }, $num_leaves;
   }
   
   close $fh;
   
   # nfi = new file in the current analysis version directory
   # here the result file will be e.g. "2014-01-28/avg_num_leaves.tsv", depending on the analysis version
   my $result_file = nfi("avg_num_leaves.tsv");
   
   # open the result file
   open my $result_fh, '>', $result_file or die "Can't open filehandle: $!";
   
   # calculate the result and write it
   while ( my ( $acc, $leaves ) = each %num_leaves ) {
     my $sum   = sum @$leaves;
     my $count = scalar @$leaves;
     my $avg   = $sum / $count;
   
     say $result_fh join( "\t", $acc, $avg );
   }
   
   close $result_fh;
Via an exported variable of Bio::Gonzales::Project::Functions, $ANALYSIS_VERSION

The script changes slightly, see here the changed lines:

original:

use Bio::Gonzales::Project::Functions;

...

# nfi = new file in the current analysis version directory
# here the result file will be e.g. "2014-01-28/avg_num_leaves.tsv", depending on the analysis version
my $result_file = nfi("avg_num_leaves.tsv");

changed:

use Bio::Gonzales::Project::Functions qw(:DEFAULT $ANALYSIS_VERSION);  # CHANGED

...

# here the result file will be e.g. "2014-01-28/avg_num_leaves.tsv", depending on the analysis version
my $result_file = "$ANALYSIS_VERSION/avg_num_leaves.tsv";  # CHANGED

Configuration

The configuration is stored in gonz.conf.yml and accessible via commandline and perl functions. The format of the configuration is YAML. You can therefore freely store any configuration in various data formats, such as lists or dictionaries.

Access via commandline

The access via commandline is intended to be used in the Makefile. The commandline script is called gonzconf. See

gonzconf --help

for help. gonzconf looks for the gonzconf.yml and extracts parts of the configuration. Example:

gonz.conf.yml

---
genotypes:
  - genotype_1
  - genotype_2
  - genotype_3

Make target:

GENOTYPES=$(shell gonzconf --flat genotypes)
analysis:
  for g in $(GENOTYPES); do \
    echo "analysing $$g"; \
  done

Access in perl

In perl scripts the configuration can be accessed via the gonzconf function.

my $config = gonzconf()

Calling the function without arguments returns the complete configuration. It can be accessed as normal perl array or hash (depending on the configuration).

Example:

#!/usr/bin/env perl

use warnings;
use strict;
use 5.010;

use Bio::Gonzales::Project::Functions;

my $config = gonzconf();
my @genotypes = @{$config->{genotypes}};

for my $genotype (@genotypes) {
  say "analysing genotype $genotype";
}
my $config_entry = gonzconf($entry)

gonzconf can take one argument to access entries of the top layer directly. By "top layer", gonzconf assumes that the structure of the configuration is organised as hash/dictionary.

Example:

#!/usr/bin/env perl

use warnings;
use strict;
use 5.010;

use Bio::Gonzales::Project::Functions;

my @genotypes = @{gonzconf("genotypes")};

for my $genotype (@genotypes) {
  say "analysing genotype $genotype";
}

Logging

Bio::Gonzales::Project::Functions comes with logging included. The logged info is stored in $ANALYSIS_VERSION/gonzlog. Therefore every analysis has a different log file. 5 log levels are available: debug, info, warn, error, fatal

Access via commandline

Run

gonzlog <namespace> <message>

to log something. The log level is hardcoded to "info".

Access via perl

Bio::Gonzales::Project::Functions exports the function gonzlog by default. To log stuff you run

gonzlog->info("message");

# or

my $log = gonzlog();
$log->info("message");

The namespace is the filename of the invoking script.

SEE ALSO

AUTHOR

jw bargsten, <joachim.bargsten at wur.nl>