NAME

FASTASequence - Perl extension Biooinformatics

SYNOPSIS

use FASTASequence;
my $fasta = qq~>sp|P01815|HV2B_HUMAN Ig heavy chain V-II region COR - Homo sapiens (Human).
QVTLRESGPALVKPTQTLTLTCTFSGFSLSSTGMCVGWIRQPPGKGLEWLARIDWDDDKY
YNTSLETRLTISKDTSRNQVVLTMDPVDTATYYCARITVIPAPAGYMDVWGRGTPVTVSS
~;
my $seq = FASTASequence->new($fasta);

ABSTRACT

This should be the abstract for FASTASequence.
The abstract is used when making PPD (Perl Package Description) files.
If you don't want an ABSTRACT you should also edit Makefile.PL to
remove the ABSTRACT_FROM option.

DESCRIPTION

This perl module is a simple utility to simplify the job of bioinformatics. It parses several information about a given FASTA-Sequence such as:

  • accession number

  • description

  • sequence itself

  • length of sequence

  • crc64 checksum (as it is used by SWISS-PROT)

METHODS

new

getAccessionNr

my $accession = $seq->getAccessionNr();

returns the AccessionNr of the FASTA-Sequence

getDescription

my $description = $seq->getDescription();

returns the description standing in the first line of the FASTA-format (without the accession number)

getSequence

my $sequence = $seq->getSequence();

returns the sequence

getCrc64

my $crc64_checksum = $seq->getCrc64();

returns the crc64 checksum of the sequence. This checksum corresponds with the crc64 checksum of SWISS-PROT

addDBRef

$seq->addDBRef(DB, REFERENCE_AC);

DB is the name of the referenced database

REFERENCE_AC is the accession number in the referenced database

seq2file

$seq->seq2file(FILENAME, OPTIONS);

FILENAME is the path of the file where the sequence has to be stored.

OPTIONS is a hash, which contains the options:

-o

overwrite the output-file if the file already exists. true / false Default: true

allIndexesOf

my $indexes = $seq->allIndexesOf(EXPR);

returns a reference on an array, which contains all indexes of EXPR in the sequence

getSequenceLength

my $length = $seq->getSequenceLength();

returns the length of the sequence

getDBRefs

my $hashref = $seq->getDBRefs();

returns a hashreference. The hash contains all references hashref = {'SWISS-PROT' => 'P01815'},

getFASTA

my $fasta_sequence = $seq->getFASTA();

returns the sequence in FASTA-format

EXAMPLE

use FASTASequence;
my $fasta = qq~>sp|P01815|HV2B_HUMAN Ig heavy chain V-II region COR - Homo sapiens (Human).
QVTLRESGPALVKPTQTLTLTCTFSGFSLSSTGMCVGWIRQPPGKGLEWLARIDWDDDKY
YNTSLETRLTISKDTSRNQVVLTMDPVDTATYYCARITVIPAPAGYMDVWGRGTPVTVSS
~;

my $seq = FASTASequence->new($fasta);

print 'The sequence of '.$seq->getAccessionNr().' is '.$seq->getSequence(),"\n";
print 'This sequence contains '.scalar($seq->allIndexesOf('C').' times Cystein at the following positions:';
print $_+1.', ' for(@{$seq->allIndexesOf('C')});

ADDITIONAL INFORMATION

accepted formats

This module can parse the following formats:

>P02656 APC3_HUMAN Apolipoprotein C-III precursor (Apo-CIII).
>IPI:IPI00166553|REFSEQ_XP:XP_290586|ENSEMBL:ENSP00000331094|TREMBL:Q8N3H0 T Hypothetical protein
>sp|P01815|HV2B_HUMAN Ig heavy chain V-II region COR - Homo sapiens (Human).

structure

The structure of the hash for the example is:

$VAR1 = {
         'seq_length' => 120,
         'accession_nr' => 'P01815',
         'text' => 'QVTLRESGPALVKPTQTLTLTCTFSGFSLSSTGMCVGWIRQPPGKGLEWLARIDWDDDKYYNTSLETRLTISKDTSRNQVVLTMDPVDTATYYCARITVIPAPAGYMDVWGRGTPVTVSS',
         'crc64' => '158A8B29AE7EEB98',
         'dbrefs' => {},
         'description' => 'Ig heavy chain V-II region COR - Homo sapiens (Human).'
       }

if you miss something please contact me.

BUGS

There is no bug known. If you experienced any problems, please contact me.

SEE ALSO

http://perl-modules.renee-baecker.de

the crc64-routine is based on the SWISS::CRC64 module.

AUTHOR

Renee Baecker, <module@renee-baecker.de>

COPYRIGHT AND LICENSE

Copyright 2004 by Renee Baecker

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.