NAME
BioUtil::Seq - Utilities for sequence
Some great modules like BioPerl provide many robust solutions. However, it is not easy to install for someone in some platforms. And for some simple task scripts, a lite module may be a good choice. So I reinvented some wheels and added some useful utilities into this module, hoping it would be helpful.
VERSION
Version 2014.1213
EXPORT
FastaReader
read_sequence_from_fasta_file
write_sequence_to_fasta_file
format_seq
validate_sequence
complement
revcom
base_content
degenerate_seq_to_regexp
degenerate_seq_match_sites
dna2peptide
codon2aa
generate_random_seqence
shuffle_sequences
rename_fasta_header
clean_fasta_header
SYNOPSIS
use BioUtil::Seq;
SUBROUTINES/METHODS
FastaReader
FastaReader is a fasta file parser using closure. FastaReader returns an anonymous subroutine, when called, it return a fasta record which is reference of an array containing fasta header and sequence.
FastaReader could also read from STDIN when the file name is "STDIN".
A boolean argument is optional. If set as "true", "return" ("\r") and "new line" ("\n") symbols in sequence will not be trimed.
Example:
# do not trim the spaces and \n
# $not_trim = 1;
# my $next_seq = FastaReader("test.fa", $not_trim);
# read from STDIN
# my $next_seq = FastaReader('STDIN');
# read from file
my $next_seq = FastaReader("test.fa");
while ( my $fa = &$next_seq() ) {
my ( $header, $seq ) = @$fa;
print ">$header\n$seq\n";
}
read_sequence_from_fasta_file
Read all sequences from fasta file.
Example:
my $seqs = read_sequence_from_fasta_file($file);
for my $header (keys %$seqs) {
my $seq = $$seqs{$header};
print ">$header\n$seq\n";
}
write_sequence_to_fasta_file
Example:
my $seq = {"seq1" => "acgagaggag"};
write_sequence_to_fasta_file($seq, "seq.fa");
format_seq
Format sequence to readable text
Example:
my $seq = {"seq1" => "acgagaggag"};
write_sequence_to_fasta_file($seq, "seq.fa");
validate_sequence
Validate a sequence.
Legale symbols:
DNA: ACGTRYSWKMBDHV
RNA: ACGURYSWKMBDHV
Protein: ACDEFGHIKLMNPQRSTVWY
gap and space: - *.
Example:
if (validate_sequence($seq)) {
# do some thing
}
complement
Complement sequence
IUPAC nucleotide code: ACGTURYSWKMBDHVN
http://droog.gs.washington.edu/parc/images/iupac.html
code base Complement
A A T
C C G
G G C
T/U T A
R A/G Y
Y C/T R
S C/G S
W A/T W
K G/T M
M A/C K
B C/G/T V
D A/G/T H
H A/C/T D
V A/C/G B
X/N A/C/G/T X
. not A/C/G/T
or- gap
my $comp = complement($seq);
revcom
Reverse complement sequence
my $recom = revcom($seq);
base_content
Example:
my $gc_cotent = base_content('gc', $seq);
degenerate_seq_to_regexp
Translate degenerate sequence to regular expression
degenerate_seq_match_sites
Find all sites matching degenerat subseq
dna2peptide
Translate DNA sequence into a peptide
codon2aa
Translate a DNA 3-character codon to an amino acid
generate_random_seqence
Example:
my @alphabet = qw/a c g t/;
my $seq = generate_random_seqence( \@alphabet, 50 );
shuffle sequences
Example:
shuffle_sequences($file, "$file.shuf.fa");
rename_fasta_header
Rename fasta header with regexp.
Example:
# delete some symbols
my $n = rename_fasta_header('[^a-z\d\s\-\_\(\)\[\]\|]', '', $file, "$file.rename.fa");
print "$n records renamed\n";
clean_fasta_header
Rename given symbols to repalcement string. Because, some symbols in fasta header will cause unexpected result.
Example:
my $file = "test.fa";
my $n = clean_fasta_header($file, "$file.rename.fa");
# replace any symbol in (\/:*?"<>|) with '', i.e. deleting.
# my $n = clean_fasta_header($file, "$file.rename.fa", '', '\/:*?"<>|');
print "$n records renamed\n";