NAME

Bio::MUST::Core::Ali::Stash - Thin wrapper for an indexed Ali read from disk

VERSION

version 0.210230

SYNOPSIS

#!/usr/bin/env perl

use Modern::Perl '2011';
# same as:
# use strict;
# use warnings;
# use feature qw(say);

use Bio::MUST::Core;
use aliased 'Bio::MUST::Core::Ali::Stash';
use aliased 'Bio::MUST::Core::IdList';

# load database
my $db = Stash->load('database.fasta');

# process OrthoFinder-like output file
# where each line defines a cluster followed by its member sequences
# cluster1: seq3 seq7 seq2
# cluster2: seq1 seq4 seq6 seq5
# ...

open my $in, '<', 'clusters.txt';
while (my $line = <$in>) {
    chomp $line;

    # extract member id list for current cluster
    my ($cluster, @ids) = split /\s+/xms, $line;
    $cluster =~ s/:\z//xms;             # remove trailing colon (:)
    my $list = IdList->new( ids => \@ids );

    # assemble Ali and store it as FASTA file
    my $ali = $list->reordered_ali($db);
       $ali->dont_guess;
    $ali->store( $cluster . '.fasta' );
}

DESCRIPTION

This module implements a class representing a sequence database where ids are indexed for faster access. To this end, it combines an internal Bio::MUST::Core::Ali object and a Bio::MUST::Core::IdList object.

An Ali::Stash is meant to be built from an existing ALI (or FASTA) file residing on disk and cannot be altered once loaded. Its sequences are supposed not to be aligned but aligned FASTA files are also processed correctly. By default, the full-length sequence ids are indexed. If the first word of each id (non-whitespace containing string or accession) is unique across the database, it can be used instead via the option <truncate_ids = 1>> of the load method (see the SYNOPSIS for an example).

While this class is more efficient than the standard Ali, it is way slower at reading large sequence databases than specialized external programs such as NCBI blastdbcmd working on indexed binary files. Thus, if you need more performance, have a look at the Blast::Database class from the Bio::MUST::Drivers distribution.

ATTRIBUTES

seqs

Bio::MUST::Core::Ali object (required)

This required attribute contains the Bio::MUST::Core::Seq objects that populate the associated sequence database file. It should be initialized through the class method load (see the SYNOPSIS for an example).

For now, it provides the following methods: count_comments, all_comments, get_comment, guessing, all_seq_ids, has_uniq_ids, is_protein, is_aligned, get_seq, get_seq_with_id (see below), first_seq, all_seqs, filter_seqs and count_seqs (see Bio::MUST::Core::Ali).

lookup

Bio::MUST::Core::IdList object (auto)

This attribute is automatically initialized with the list indexing the sequence ids of the internal Ali object. Thus, it cannot be user-specified.

It provides the following method: index_for (see Bio::MUST::Core::IdList). Yet, it is nearly a private method. Instead, individual sequences should be accessed through the get_seq_with_id method (see below), while sequence batches should be recovered via user-specified IdList objects (see the SYNOPSIS for an example).

ACCESSORS

get_seq_with_id

Returns a sequence of the Ali::Stash by its id. Note that sequence ids are assumed to be unique in the corresponding database. If no sequence exists for the specified id, this method will return undef.

my $id = 'Pyrus malus_3750@658052655';
my $seq = $db->get_seq_with_id($id);
croak "Seq $id not found in Ali::Stash!" unless defined $seq;

This method accepts just one argument (and not an array slice).

It is a faster implementation of the same method from the Ali class.

I/O METHODS

load

Class method (constructor) returning a new Ali::Stash read from disk. As in Ali, this method will transparently import plain FASTA files in addition to the MUST pseudo-FASTA format (ALI files).

# load database
my $db = Stash->load( 'database.fasta' );

# alternatively... (indexing only accessions)
my $db = Stash->load( 'database.fasta', { truncate_ids => 1 } );

This method requires one argument and accepts a second optional argument controlling the way sequence ids are processed. It is a hash reference that may only contain the following key:

- truncate_ids: consider only the first id word (accession)

AUTHOR

Denis BAURAIN <denis.baurain@uliege.be>

COPYRIGHT AND LICENSE

This software is copyright (c) 2013 by University of Liege / Unit of Eukaryotic Phylogenomics / Denis BAURAIN.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.