lib/Bio/MUST/Core/Ali/Stash.pm


            
              1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
—
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
              package Bio::MUST::Core::Ali::Stash;
# ABSTRACT: Thin wrapper for an indexed Ali read from disk
$Bio::MUST::Core::Ali::Stash::VERSION = '0.250380';
use Moose;
use namespace::autoclean;
use autodie;
use feature qw(say);
use Smart::Comments;
use Carp;
use Bio::MUST::Core::Types;
use Bio::MUST::Core::Constants qw(:seqids);
use aliased 'Bio::MUST::Core::Ali';
# ATTRIBUTES
has 'seqs' => (
    is       => 'ro',
    isa      => 'Bio::MUST::Core::Ali',
    required => 1,
    handles  => [
        qw(filename count_comments all_comments get_comment
            guessing all_seq_ids has_uniq_ids is_protein is_aligned
            get_seq first_seq all_seqs filter_seqs count_seqs
            gapmiss_regex
        )
    ],      # comment-related methods needed by IdList
);
has 'lookup' => (
    is       => 'ro',
    isa      => 'Bio::MUST::Core::IdList',
    init_arg => undef,
    lazy     => 1,
    builder  => '_build_lookup',
    handles  => [ qw(index_for) ],
);
with 'Bio::MUST::Core::Roles::Aliable';
## no critic (ProhibitUnusedPrivateSubroutines)
sub _build_lookup {
    return shift->seqs->new_lookup;
}
## use critic
# ACCESSORS
sub get_seq_with_id {
    my $self = shift;
    my $id   = shift;
    # override Ali method with faster lookup-based alternative
    my $index = $self->index_for($id);
    return $self->get_seq($index)
        if defined $index;
    carp "[BMC] Warning: cannot find seq with id: $id; returning undef!";
    return;
}
# I/O methods
sub load {
    my $class  = shift;
    my $infile = shift;
    my $args   = shift // {};           # HashRef (should not be empty...)
    my $seqs = Ali->load($infile);
       $seqs->dont_guess;
    if ( $args->{truncate_ids} ) {
        my $mapper = $seqs->regex_mapper( q{}, $DEF_ID );
        $seqs->shorten_ids($mapper);
    }
    return $class->new(seqs => $seqs);
}
__PACKAGE__->meta->make_immutable;
1;
__END__
=pod
=head1 NAME
Bio::MUST::Core::Ali::Stash - Thin wrapper for an indexed Ali read from disk
=head1 VERSION
version 0.250380
=head1 SYNOPSIS
    #!/usr/bin/env perl
    use Modern::Perl '2011';
    # same as:
    # use strict;
    # use warnings;
    # use feature qw(say);
    use Bio::MUST::Core;
    use aliased 'Bio::MUST::Core::Ali::Stash';
    use aliased 'Bio::MUST::Core::IdList';
    # load database
    my $db = Stash->load('database.fasta');
    # process OrthoFinder-like output file
    # where each line defines a cluster followed by its member sequences
    # cluster1: seq3 seq7 seq2
    # cluster2: seq1 seq4 seq6 seq5
    # ...
    open my $in, '<', 'clusters.txt';
    while (my $line = <$in>) {
        chomp $line;
        # extract member id list for current cluster
        my ($cluster, @ids) = split /\s+/xms, $line;
        $cluster =~ s/:\z//xms;             # remove trailing colon (:)
        my $list = IdList->new( ids => \@ids );
        # assemble Ali and store it as FASTA file
        my $ali = $list->reordered_ali($db);
           $ali->dont_guess;
        $ali->store( $cluster . '.fasta' );
    }
=head1 DESCRIPTION
This module implements a class representing a sequence database where ids are
indexed for faster access. To this end, it combines an internal
L<Bio::MUST::Core::Ali> object and a L<Bio::MUST::Core::IdList> object.
An Ali::Stash is meant to be built from an existing ALI (or FASTA) file
residing on disk and cannot be altered once loaded. Its sequences are supposed
not to be aligned but aligned FASTA files are also processed correctly. By
default, the full-length sequence ids are indexed. If the first word of each
id (non-whitespace containing string or accession) is unique across the
database, it can be used instead via the option C<<truncate_ids => 1>> of the
C<load> method (see the SYNOPSIS for an example).
While this class is more efficient than the standard C<Ali>, it is way slower
at reading large sequence databases than specialized external programs such as
NCBI C<blastdbcmd> working on indexed binary files. Thus, if you need more
performance, have a look at the C<Blast::Database> class from the
L<Bio::MUST::Drivers> distribution.
=head1 ATTRIBUTES
=head2 seqs
L<Bio::MUST::Core::Ali> object (required)
This required attribute contains the L<Bio::MUST::Core::Seq> objects that
populate the associated sequence database file. It should be initialized
through the class method C<load> (see the SYNOPSIS for an example).
For now, it provides the following methods: C<count_comments>,
C<all_comments>, C<get_comment>, C<guessing>, C<all_seq_ids>, C<has_uniq_ids>,
C<is_protein>, C<is_aligned>, C<get_seq>, C<get_seq_with_id> (see below),
C<first_seq>, C<all_seqs>, C<filter_seqs> and C<count_seqs> (see
L<Bio::MUST::Core::Ali>).
=head2 lookup
L<Bio::MUST::Core::IdList> object (auto)
This attribute is automatically initialized with the list indexing the
sequence ids of the internal C<Ali> object. Thus, it cannot be user-specified.
It provides the following method: C<index_for> (see
L<Bio::MUST::Core::IdList>). Yet, it is nearly a private method. Instead,
individual sequences should be accessed through the C<get_seq_with_id> method
(see below), while sequence batches should be recovered via user-specified
IdList objects (see the SYNOPSIS for an example).
=head1 ACCESSORS
=head2 get_seq_with_id
Returns a sequence of the Ali::Stash by its id. Note that sequence ids are
assumed to be unique in the corresponding database. If no sequence exists for
the specified id, this method will return C<undef>.
    my $id = 'Pyrus malus_3750@658052655';
    my $seq = $db->get_seq_with_id($id);
    croak "Seq $id not found in Ali::Stash!" unless defined $seq;
This method accepts just one argument (and not an array slice).
It is a faster implementation of the same method from the C<Ali> class.
=head1 I/O METHODS
=head2 load
Class method (constructor) returning a new Ali::Stash read from disk. As in
C<Ali>, this method will transparently import plain FASTA files in addition to
the MUST pseudo-FASTA format (ALI files).
    # load database
    my $db = Stash->load( 'database.fasta' );
    # alternatively... (indexing only accessions)
    my $db = Stash->load( 'database.fasta', { truncate_ids => 1 } );
This method requires one argument and accepts a second optional argument
controlling the way sequence ids are processed. It is a hash reference that
may only contain the following key:
    - truncate_ids: consider only the first id word (accession)
=head1 AUTHOR
Denis BAURAIN <denis.baurain@uliege.be>
=head1 COPYRIGHT AND LICENSE
This software is copyright (c) 2013 by University of Liege / Unit of Eukaryotic Phylogenomics / Denis BAURAIN.
This is free software; you can redistribute it and/or modify it under
the same terms as the Perl 5 programming language system itself.
=cut
	Global
`s`	Focus search bar
`?`	Bring up this help dialog
	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)
	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse
	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)