NAME

BioUtil::Seq - Utilities for sequence

Some great modules like BioPerl provide many robust solutions. However, it is not easy to install for someone in some platforms. And for some simple task scripts, a lite module may be a good choice. So I reinvented some wheels and added some useful utilities into this module, hoping it would be helpful.

VERSION

Version 2014.1213

EXPORT

FastaReader
read_sequence_from_fasta_file 
write_sequence_to_fasta_file 
format_seq

validate_sequence 
complement
revcom 
base_content 
degenerate_seq_to_regexp
degenerate_seq_match_sites
dna2peptide 
codon2aa 
generate_random_seqence

shuffle_sequences 
rename_fasta_header 
clean_fasta_header 

SYNOPSIS

use BioUtil::Seq;

SUBROUTINES/METHODS

FastaReader

FastaReader is a fasta file parser using closure. FastaReader returns an anonymous subroutine, when called, it return a fasta record which is reference of an array containing fasta header and sequence.

FastaReader could also read from STDIN when the file name is "STDIN".

A boolean argument is optional. If set as "true", "return" ("\r") and "new line" ("\n") symbols in sequence will not be trimed.

Example:

# do not trim the spaces and \n
# $not_trim = 1;
# my $next_seq = FastaReader("test.fa", $not_trim);

# read from STDIN
# my $next_seq = FastaReader('STDIN');

# read from file
my $next_seq = FastaReader("test.fa");

while ( my $fa = &$next_seq() ) {
    my ( $header, $seq ) = @$fa;

    print ">$header\n$seq\n";
}

read_sequence_from_fasta_file

Read all sequences from fasta file.

Example:

my $seqs = read_sequence_from_fasta_file($file);
for my $header (keys %$seqs) {
    my $seq = $$seqs{$header};
    print ">$header\n$seq\n";
}

write_sequence_to_fasta_file

Example:

my $seq = {"seq1" => "acgagaggag"};
write_sequence_to_fasta_file($seq, "seq.fa");

format_seq

Format sequence to readable text

Example:

my $seq = {"seq1" => "acgagaggag"};
write_sequence_to_fasta_file($seq, "seq.fa");

validate_sequence

Validate a sequence.

Legale symbols:

DNA: ACGTRYSWKMBDHV
RNA: ACGURYSWKMBDHV
Protein: ACDEFGHIKLMNPQRSTVWY
gap and space: - *.

Example:

if (validate_sequence($seq)) {
    # do some thing
}

complement

Complement sequence

IUPAC nucleotide code: ACGTURYSWKMBDHVN

http://droog.gs.washington.edu/parc/images/iupac.html

code    base    Complement
A   A   T
C   C   G
G   G   C
T/U T   A

R   A/G Y
Y   C/T R
S   C/G S
W   A/T W
K   G/T M
M   A/C K

B   C/G/T   V
D   A/G/T   H
H   A/C/T   D
V   A/C/G   B

X/N A/C/G/T X
.   not A/C/G/T
 or-    gap

my $comp = complement($seq);

revcom

Reverse complement sequence

my $recom = revcom($seq);

base_content

Example:

my $gc_cotent = base_content('gc', $seq);

degenerate_seq_to_regexp

Translate degenerate sequence to regular expression

degenerate_seq_match_sites

Find all sites matching degenerat subseq

dna2peptide

Translate DNA sequence into a peptide

codon2aa

Translate a DNA 3-character codon to an amino acid

generate_random_seqence

Example:

my @alphabet = qw/a c g t/;
my $seq = generate_random_seqence( \@alphabet, 50 );

shuffle sequences

Example:

shuffle_sequences($file, "$file.shuf.fa");

rename_fasta_header

Rename fasta header with regexp.

Example:

# delete some symbols
my $n = rename_fasta_header('[^a-z\d\s\-\_\(\)\[\]\|]', '', $file, "$file.rename.fa");
print "$n records renamed\n";

clean_fasta_header

Rename given symbols to repalcement string. Because, some symbols in fasta header will cause unexpected result.

Example:

my  $file = "test.fa";
my $n = clean_fasta_header($file, "$file.rename.fa");
# replace any symbol in (\/:*?"<>|) with '', i.e. deleting.
# my $n = clean_fasta_header($file, "$file.rename.fa", '',  '\/:*?"<>|');
print "$n records renamed\n";