NAME

Bio::Grep - Perl extension for searching in Fasta files

VERSION

This document describes Bio::Grep version 0.5.0

SYNOPSIS

 use Bio::Grep;
 
 my $search_obj = Bio::Grep->new('Vmatch');	
 
 # $sbe is now a reference to the back-end
 my $sbe = $search_obj->backend;

 # define the location of the suffix arrays
 $sbe->settings->datapath('data');
 
 mkdir($sbe->settings->datapath);	
 
 # now generate a suffix array. you have to do this only once.
 $sbe->generate_database_out_of_fastafile('t/Test.fasta', 'Description for the test Fastafile');
 
 # search in this suffix array
 $sbe->settings->database('Test.fasta');
 
 # search for the reverse complement and allow 2 mismatches
 $sbe->settings->query('UGAACAGAAAG');
 $sbe->settings->reverse_complement(1);
 $sbe->settings->mismatches(2);

 # or you can use Fasta file with queries
 # $sbe->settings->query_file('Oligos.fasta');

 # $sbe->search();

 # Alternatively, you can specify the settings in the search call.
 # This also resets everything except the paths and the database
 # (because it is likely that they don't change when search is called
 # multiple times)

 $sbe->search( { query  =>  'UGAACAGAAAG',
                 reverse_complement => 1,
                 mismatches         => 2,
                });  
 
 my @ids;

 # output some informations! 
 while ( my $res = $sbe->next_res ) {
    print $res->sequence->id . "\n";
    print $res->alignment_string() . "\n\n";
    push @ids, $res->sequence_id;
 }
 
 # get the gene sequences of all matches as Bio::SeqIO object.
 # (to generate a Fasta file for example)
 my $seqio = $sbe->get_sequences(\@ids);

DESCRIPTION

Bio-Grep is a collection of Perl modules for searching in Fasta files. It supports different back-ends, most importantly some (enhanced) suffix array implementations. Currently, there is no suffix array tool that works in all scenarios (for example whole genome, protein and RNA data). Bio::Grep provides a common API to the most popular tools. This way, you can easily switch or combine tools.

METHODS

new($backend)

This function constructs a Bio::Grep object. Available back-ends are Vmatch, Agrep, GUUGle and Hypa. Vmatch is default.

Sets temporary path to File::Spec->tmpdir();

my $search_obj = Bio::Grep->new('Agrep');	
backend()

Get/set the back-end. This is a object that uses Bio::Grep::Backend::BackendI as base class. See Bio::Grep::Backends::BackendI, Bio::Grep::Backends::Vmatch, Bio::Grep::Backends::Agrep, Bio::Grep::Backends::GUUGle and Bio::Grep::Backends::Hypa

FEATURES

  • We support most of the features of the back-ends. If a particular feature is not supported, then we probably did not need it until now. But in general it should be easy to integrate. For a complete list of supported features, see Bio::Grep::Container::SearchSettings, for an overview see "FEATURE COMPARISON".

  • This module should be suitable for large datasets. The back-end output is piped to a temporary file and the parser only stores the current hit in memory.

  • Bio::Grep has a nice interface for search result filters. See "FILTERS".

  • Bio::Grep was in particular designed for web services and therefore checks the settings carefully before calling back-ends. See "SECURITY".

QUICK START

This is only a short overview of the functionality of this module. You should also read Bio::Grep::Backends::BackendI and the documentation of the back-end you want to use (e.g. Bio::Grep::Backends::Vmatch).

GENERATE DATABASES

As a first step, you have to generate a Bio::Grep database out of your Fasta file in which you want to search. A Bio::Grep database consists of a couple of files and allows you to retrieve informations about the database as well as to perform queries as fast and memory efficient as possible. You have to do this only once for every Fasta file.

For example:

my $sbe = Bio::Grep->new('Vmatch')->backend;	
$sbe->settings->datapath('data');
$sbe->generate_database_out_of_fastafile('../t/Test.fasta', 'Description for the test Fastafile');

Now, in a second script:

my $sbe = Bio::Grep->new('Vmatch')->backend;	
$sbe->settings->datapath('data');

my %local_dbs_description = $sbe->get_databases();
my @local_dbs = sort keys %local_dbs_description;

Alternatively, you can use bgrep which is part of this distribution:

bgrep --backend Vmatch --database TAIR6_cdna_20060907 --datapath data --createdb

SEARCH SETTINGS

All search settings are stored in the Bio::Grep::Container::SearchSettings object of the back-end:

$sbe->settings

To set an option, call

$sbe->settings->optionname(value)

For example

$sbe->settings->datapath('data');
# take first available database 
$sbe->settings->database($local_dbs[0]);

See the documentation of your back-end for available options.

To start the back-end with the specified settings, simply call

$sbe->search();

This method also accepts an hash reference with settings. In this case, all previous defined options except all paths and the database are set to their default values.

$sbe->search({ mismatches => 2, 
               reverse_complement => 0, 
               query => $query });

ANALYZE SEARCH RESULTS

Use such a Bioperl like while loop to analyze the search results.

while ( my $res = $sbe->next_res ) {
   print $res->sequence->id . "\n";
   print $res->alignment_string() . "\n\n";
}

See Bio::Grep::Container::SearchResult for all available informations.

BGREP

This distribution comes with a sample script called bgrep. See bgrep for details.

WHICH BACKEND?

We support this back-ends:

FEATURE COMPARISON

FeatureAgrepGUUGleHyPaVmatch
Persistent Index1 no no yes yes
Mismatches yes no yes yes
Edit Distance yes no no yes
Insertions no no yes no
Deletions no no yes no
Multiple Queries2 no yes no yes
GU3 no yes yes no
DNA/RNA yes yes yes yes
Protein yes no yes yes
Reverse Complement yes yes yes yes
Upstream/Downstream Regions no yes yes yes
Filters no yes yes yes
Query Length4 no yes no yes

1Needs precalculation and (much) more memory but queries are in general faster
2With query_file
3HyPa also allows that GU counts only as 0.5 mismatches
4Matches if a substring of the query of size n or larger matches

Vmatch is fast but needs a lot of memory. Agrep is the best choice if you allow many mismatches in short sequences, if you want to search in Fasta files with relatively short sequences (e.g transcript databases) and if you are only interested in which sequences the approximate match was found. Its performance is in this case amazing. If you want the exact positions of a match in the sequence, choose vmatch. If you want nice alignments, choose vmatch too (EMBOSS can automatically align the sequence and the query in the agrep back-end, but then vmatch is faster). Filters require exact positions, so you can't use them with agrep. This may change in future version or not.

GUUGle may be the best choice if you have RNA queries (counts GU as no mismatch) and if you are interested in only exact matches. Another solution here would be to use Vmatch and write a filter (see next section) that only allows GU mismatches. Of course, this is only an alternative if you can limit ($sbe->settings->mismatches()) the maximal number of GU mismatches. Vmatch with its precalculated suffix arrays is really fast, so you should consider this option.

FILTERS

Filtering search results is a common task. For that, Bio::Grep provides an filter interface, Bio::Grep::Filter::FilterI. Writing filters is straightforward:

package MyFilter;

use strict;
use warnings;

use Bio::Grep::Filter::FilterI;

use base 'Bio::Grep::Filter::FilterI';

use Class::MethodMaker
 [ new => [ qw / new2 / ],
   ... # here some local variables, see perldoc Class::MethodMaker
 ];

sub new {
   my $self = shift->new2;
   $self->delete(1); # a filter that actually filters, not only adds
                     # remarks to $self->search_result->remark

   $self->supports_alphabet( dna => 1, protein => 1);
   $self;
}

sub filter {
   my $self = shift;
   # code that examines $self->search_result
   # and returns 0 (not passed) or 1 (passed)
   ...
   $self->message('passed');
   return 1;
}   

sub reset {
   my $self = shift;
   # if you need local variables, you can clean up here
}

1;# Magic true value required at end of module

To apply your filter:

...

my $filter = MyFilter->new();

$sbe->settings->filters( ( $filter ) );
$sbe->search();

See Bio::Grep::Filter::FilterI.

SECURITY

The use of Bio::Grep (in Web Services for example) should be quite secure. All test run in taint mode. We check the settings before we generate the string for the system() call. We use File::Temp for all temporary files.

INCOMPATIBILITIES

None reported.

BUGS AND LIMITATIONS

No bugs have been reported.

There is not yet a nice interface for searching for multiple queries. However, Vmatch and GUUGle support this feature. So you can generate a Fasta query file with Bio::SeqIO and then set $sbe->settings->query_file(). To find out, to which query a match belongs, you have to check $res->query.

It is likely that $sbe->settings->query is renamed to queries().

Please report any bugs or feature requests to bug-bio-grep@rt.cpan.org, or through the web interface at http://rt.cpan.org.

SEE ALSO

Bio::Grep::Backends::BackendI Bio::Grep::Backends::Vmatch Bio::Grep::Backends::Agrep Bio::Grep::Backends::Hypa Bio::Grep::Backends::GUUGle

PUBLICATIONS

GUUGle: http://bioinformatics.oxfordjournals.org/cgi/content/full/22/6/762

HyPa: http://nar.oxfordjournals.org/cgi/content/full/29/1/196

AUTHOR

Markus Riester, <mriester@gmx.de>

LICENCE AND COPYRIGHT

Based on Weigel::Seach v0.13

Copyright (C) 2005-2006 by Max Planck Institute for Developmental Biology, Tuebingen.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

DISCLAIMER OF WARRANTY

BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE SOFTWARE "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR, OR CORRECTION.

IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.