NAME

BioUtil::Seq - Utilities for sequence

Some great modules like BioPerl provide many robust solutions. However, it is not easy to install for someone in some platforms. And for some simple task scripts, a lite module may be a good choice. So I reinvented some wheels and added some useful utilities into this module, hoping it would be helpful.

VERSION

Version 2015.0309

EXPORT

FastaReader
read_sequence_from_fasta_file 
write_sequence_to_fasta_file 
format_seq

validate_sequence 
complement
revcom 
base_content 
degenerate_seq_to_regexp
match_regexp
dna2peptide 
codon2aa 
generate_random_seqence

shuffle_sequences 
rename_fasta_header 
clean_fasta_header 

SYNOPSIS

use BioUtil::Seq;

SUBROUTINES/METHODS

FastaReader

FastaReader is a fasta file parser using closure. FastaReader returns an anonymous subroutine, when called, it return a fasta record which is reference of an array containing fasta header and sequence.

FastaReader could also read from STDIN when the file name is "STDIN" or "stdin".

A boolean argument is optional. If set as "true", spaces including blank, tab, "return" ("\r") and "new line" ("\n") symbols in sequence will not be trimed.

FastaReader speeds up by utilizing the special Perl variable $/ (set to "\n>"), with kind help of Mario Roy, author of MCE (https://code.google.com/p/many-core-engine-perl/). A lot of optimizations were also done by him.

Example:

# do not trim the spaces and \n
# $not_trim = 1;
# my $next_seq = FastaReader("test.fa", $not_trim);

# read from STDIN
# my $next_seq = FastaReader('STDIN');

# read from file
my $next_seq = FastaReader("test.fa");

while ( my $fa = &$next_seq() ) {
    my ( $header, $seq ) = @$fa;

    print ">$header\n$seq\n";
}

read_sequence_from_fasta_file

Read all sequences from fasta file.

Example:

my $seqs = read_sequence_from_fasta_file($file);
for my $header (keys %$seqs) {
    my $seq = $$seqs{$header};
    print ">$header\n$seq\n";
}

write_sequence_to_fasta_file

Example:

my $seq = {"seq1" => "acgagaggag"};
write_sequence_to_fasta_file($seq, "seq.fa");

format_seq

Format sequence to readable text

Example:

printf ">%s\n%s", $head, format_seq($seq, 60);

validate_sequence

Validate a sequence.

Legale symbols:

DNA: ACGTRYSWKMBDHV
RNA: ACGURYSWKMBDHV
Protein: ACDEFGHIKLMNPQRSTVWY
gap and space: - *.

Example:

if (validate_sequence($seq)) {
    # do some thing
}

complement

Complement sequence

IUPAC nucleotide code: ACGTURYSWKMBDHVN

http://droog.gs.washington.edu/parc/images/iupac.html

code    base    Complement
A   A   T
C   C   G
G   G   C
T/U T   A

R   A/G Y
Y   C/T R
S   C/G S
W   A/T W
K   G/T M
M   A/C K

B   C/G/T   V
D   A/G/T   H
H   A/C/T   D
V   A/C/G   B

X/N A/C/G/T X
.   not A/C/G/T
 or-    gap

my $comp = complement($seq);

revcom

Reverse complement sequence

my $recom = revcom($seq);

base_content

Example:

my $gc_cotent = base_content('gc', $seq);

degenerate_seq_to_regexp

Translate degenerate sequence to regular expression

match_regexp

Find all sites matching the regular expression.

See https://github.com/shenwei356/bio_scripts/blob/master/sequence/fasta_locate_motif.pl

dna2peptide

Translate DNA sequence into a peptide

codon2aa

Translate a DNA 3-character codon to an amino acid

generate_random_seqence

Example:

my @alphabet = qw/a c g t/;
my $seq = generate_random_seqence( \@alphabet, 50 );

shuffle sequences

Example:

shuffle_sequences($file, "$file.shuf.fa");

rename_fasta_header

Rename fasta header with regexp.

Example:

# delete some symbols
my $n = rename_fasta_header('[^a-z\d\s\-\_\(\)\[\]\|]', '', $file, "$file.rename.fa");
print "$n records renamed\n";

clean_fasta_header

Rename given symbols to repalcement string. Because, some symbols in fasta header will cause unexpected result.

Example:

my  $file = "test.fa";
my $n = clean_fasta_header($file, "$file.rename.fa");
# replace any symbol in (\/:*?"<>|) with '', i.e. deleting.
# my $n = clean_fasta_header($file, "$file.rename.fa", '',  '\/:*?"<>|');
print "$n records renamed\n";