#+TITLE: FAST: Fast Analysis of Sequences Toolbox Cookbook
#+AUTHOR: Katherine C.H. Amrine \& David H. Ardell
#+LANGUAGE:  en
#+TEXT:      These examples are executable from the installation directory.
#+OPTIONS:   H:5 num:t toc:t \n:nil @:t ::t |:t ^:t *:t TeX:t d:nil
#+LATEX_HEADER: \usepackage[colorlinks=true,urlcolor=SteelBlue4,linkcolor=Firebrick4]{hyperref}
#+STARTUP: align

* Tutorials

** Prelude 

*** The FAST definition of "FastA format"

FastA format began with the FastA search utilities of William
Pearson. For FAST, "fasta format" means what is conventionally called
"multi-fasta" format of sequence or alignment data, largely as
implementated in BioPerl in the module =Bio::SeqIO::fasta=. 

In the FAST implementation of "fasta format", multiple sequence
records may appear in a single file or input stream. Sequence data may
contain gap characters. The logical elements of a sequence record are
its /identifier/, /description/ and /sequence/. The /identifier/
(indicated with =id= in the example here) and /description/ (=desc=)
together make the /identifier line/ of a sequence record, that must
begin with the sequence record start symbol =>= on a single line. The
/description/ begins after the first block of white-space on this line
(indicated with =<space>=). The /sequence/ of a record
appears immediately after its identifier line and may continue over
multiple lines until the next record. 

In FAST, the description may be broken into multiple /fields/ defined
by a /delimiter/ (indicated with =<delim>=). FAST uses a "one-based"
indexing of fields as indicated here:

#+BEGIN_EXAMPLE
>seq1-id<space>seq1-desc-field1<delim>seq1-desc-field2<delim>...
seq1-sequence
seq1-sequence
...
seq1-sequence
>seq2-id<space>seq2-desc-field1<delim>seq2-desc-field2<delim>...
seq2-sequence
seq2-sequence
...
seq2-sequence
#+END_EXAMPLE

*** Use =man= pages for full documentation

All FAST utilities follow UNIX conventions in having default and
optional behaviors. For more information about how to use and modify
the behavior of any FAST utility such as =faswc=, consult its manual
page with /e.g./:

#+BEGIN_SRC shell
man faswc
#+END_SRC

or alternatively:

#+BEGIN_SRC shell
perldoc faswc
#+END_SRC


** Example 1: Prototyping a pipeline to cut, reverse complement, and translate a gene by coordinate from a genome

*** Calculating sequence length

Chromosome 1 from the /Saccharomyces cerevisiae/ genome is available
in =t/data/chr01.fsa=. By default, =faslen= calculates the lengths of
sequence records on its input, and outputs its input, augmenting
sequence descriptions with its calculations using the tag (or name)
=length= and a (name,value) separator =:=, as in =length:872=. We can
therefore easily obtain the length of this chromosome sequence as
follows:

#+BEGIN_SRC shell
faslen t/data/chr01.fsa | egrep ">"
#+END_SRC

Alternatively, =faswc -c= will output the length of the chromosome
directly to =STDOUT=:

#+BEGIN_SRC shell
faswc -c t/data/chr01.fsa
#+END_SRC

*** Cut out a subsequence by coordinate with =fascut=

=fascut= will cut a subsequence by coordinate. For example, suppose we
know that the location of gene =YAR030C= in yeast chromosome 1 begins
186512 and ends 186853 on the minus strand. Let's cut this from our
chromosome. The following code will extract this subsequence in fasta
format to =STDOUT=:

#+BEGIN_SRC shell
fascut 186512..186853 t/data/chr01.fsa
#+END_SRC

*** Computing reverse complement of a sequence with =fasrc=

Knowing that this is on the minus strand, we need to obtain the
reverse complement of this sequence.  =fasrc= will compute this.  The
following code will take the output of =fascut= as its input and
return the reverse complemeht in fasta file to =STDOUT=:

#+BEGIN_SRC shell
fascut 186512..186853 t/data/chr01.fsa | fasrc
#+END_SRC

*** Translating a sequence with =fasxl=

To translate this sequence, we extend the pipeline with the =fasxl= utility:

#+BEGIN_SRC shell
fascut 186512..186853 t/data/chr01.fsa | fasrc | fasxl
#+END_SRC

Examine the output, we will see that the peptide starts with a
methionine, and ends with a stop codon, indicated by the =*= character
by default. 

*** Computing codon usage with =fascodon=

If we are interested in the codon usage of our gene, we can edit the
last command-line (by typing =up-arrow= on most UNIX shells) and
replace =fasxl= with =fascodon= at the end of our pipeline. =fascodon=
reprints the input sequence with the counts of each codon with 
information on starts and stops appended to the identifier. With the following
code, we can see that the most frequently used codon in this example
is =AAT= (encoding an Arginine)

#+BEGIN_SRC shell
fascut 186512..186853 t/data/chr01.fsa | fasrc | fascodon
#+END_SRC 

Appending the =-t= option to the command will give the codon frequencies
for both the input sequence and the reverse complement in a verbose
table format as follows 

#+BEGIN_SRC shell
fascut 186512..186853 t/data/chr01.fsa | fasrc | fascodon -t
#+END_SRC


*** Computing base composition with =fascomp=

=fascomp= will return the base/protein composition of a sequence
as an addition to the identifier. If
we are interested in the normalized (option =-n=) base composition of the first
chromosome in a clean table format (option =-t=), we can run the following:

#+BEGIN_SRC shell
fascomp -nt t/data/chr01.fsa
#+END_SRC


** Example 2: Reformatting, selecting and transforming alignments in FAST

*** Reformatting alignment data with =fasconvert=

A file with protein sequences that match a search for "P450" is
available in =t/data/P450.fas= under the FAST installation
directory. Another file contains this data aligned using =clustalw=
with the name =P450.clustalw2.aln=. The =fasconvert= tool can convert
from fasta to many formats, or from many formats to fasta, including
clustalw to fasta as showin in the following example

#+BEGIN_SRC shell
fasconvert -i clustalw -f t/data/P450.clustalw2.aln
#+END_SRC

The previous command automatically saves its output to an output file
of the same basename and an extension of =.fas= in the same
directory of the original file. The
=faswc= utility will count the total number of sequences and 
total number of nucleotides in a fasta file
To look at the length of all sequences, use the
following code.

#+BEGIN_SRC shell
faswc t/data/P450.clustalw2.fas
#+END_SRC

which outputs  

=9      5013 t/data/P450.clustalw2.fas=

=9      5013 total=

to  =STDOUT=.

*** Selecting sequences with =fasgrep=

We can subset the output in many ways to get information we are
interested in, for example, if we want to get the original sequence
with the gi number "86475799", we can use =fasgrep=, which will pull
out sequences that match a Perl regular expression. By default,
=fasgrep= attempts to match sequence identifiers:

#+BEGIN_SRC shell
fasgrep "86475799" t/data/P450.fas
#+END_SRC

We can retrieve the aligned version of this sequence as it has the
same identifier

#+BEGIN_SRC shell
fasgrep "86475799" t/data/P450.clustalw2.fas
#+END_SRC

*** Reformatting gap characters with =fastr=

=fastr= may be useful when we must change specific characters based on
the requirements of a bioinformatic program. For example, to reformat
gap characters in a fasta-format alignment from "-" to ".".

#+BEGIN_SRC shell
fastr -s "-" "." t/data/P450.clustalw2.fas
#+END_SRC 

*** Degapping sites with =alncut= 

=alncut= also allows for editing of alignments based on their gap
profile. This utility is useful in many applications, including 
selecting gap-free sites for input into phylogenetic softwares.
To remove sites with at least one gap in all sequences, we
can do the following:

#+BEGIN_SRC shell
alncut -g t/data/P450.clustalw2.fas
#+END_SRC

We can then determine the length of the alignment by executing:

#+BEGIN_SRC shell
alncut -g t/data/P450.clustalw2.fas | faslen | head -1 
#+END_SRC

And if we are interested in retaining only unique sequences, 
/fasuniq/ appended to the output will collapse duplicate sequences to
one, appending all of the identifiers to one large identifier. 

#+BEGIN_SRC shell
alncut -g t/data/P450.clustalw2.fas | faslen | fasuniq
#+END_SRC

** Example 3: partitioning files based on NCBI taxonomy

The =fastax= tool is a powerful tool when one wants to partition
data based on their taxonomic affiliations. We can partition large
datasets for subset analyses, statistical comparisons, and other
applications and preparation of data. =fastax= depends on the user
supplying a file with the tree structure already defined. In this
example, we will use NCBI taxonomy. The files necessary include
a nodes file (in this case /nodes.dmp/) and a names file (in this
case, /names.dmp/). The nodes file consists of a line for each
taxonomic entry in NCBI with information about its class (superfamily,
genus, etc) and its parent node, indexed by its numeric identifier. 
These files were retrieved from NCBI in a zipped package via FTP 
located at pub/taxonomy/taxdump.tar.gz. The names.dmp file will 
link the numeric identifier to any specific name that the entry can 
be named including its scientific name, common name, and alternative 
spellings accepted by NCBI. We will not create our own nodes and 
names files, but note that it can be done if the user disagrees with 
the NCBI taxonomic structure, or requires more specific partitions 
of their data. 

The importance of structured sequence tags comes into play in this 
example. As described above, the line above the sequence in a fasta
file is indexed by a ">" character followed by the identifier, followed
by a space, and then everything else is located in the description. 
=fastax= will need more structure around the taxonomic classification. 
The program, by default, will search by the description field, but the 
descrption field will have it's own structure. Note, if our description
field only contains the species, or the TaxID, then we don't need
to worry about structuring the description. If it is not the only thing
in our description, then we need to modify the description a bit, or
determine if there is a delimiter that already exists between the 
TaxID and the other components of the description. 

*** Reformat the description to allow for fastax sorting

If we look at our P450.fas file, we see that the description 
consists of "P450" and then a space, and then a square bracket "[",
the species name, and then a closing square bracket "]". There is 
currently no identifier that is unique surrounding only the species
name. Open bracket and closed bracket are two different characters,
and using the open bracket as a delimiter will give you the species
name and the closed bracket in the second field, and this will not 
match correctly. We will first change our description field to one
delimiter using the handy =fastr= tool. Arbitrarily, we will chose 
the double quote character for our description delimiter.

#+BEGIN_SRC shell
fastr -d "[]" "\"" t/data/P450.fas
#+END_SRC

Now in standard output, we will see fasta file-formatted text with
the species name in the description in the sequence tag surrounded
by quotes. In this file, the beginning of the description is in 
field one, and the species in field two. We can use this information
to construct a command to pull out the sequences that are in the 
taxonomic "Pooideae" tribe. Assuming that nodes.dmp and names.dmp are
in the same working directory, we can run the following. 

#+BEGIN_SRC shell
fastr -d "[]" "\"" t/data/P450.fas | 
      fastax -S \" -f 2 t/data/nodes.dmp t/data/names.dmp "Pooideae"
#+END_SRC

The output of this pipeline should be five sequences, including P450 sequences
from the /Triticum aestivum/ and /Lolium rigidum/ species (classified as species
belonging to the /Pooideae/ tribe). 

** Example 4: Retrieve a subset of sequences based on a list of identifiers

=fasgrep= is a useful tool for retrieving subsets of sequences from large
fasta files. Often fasta files will contain an identifier line, and then 
one line following with a sequence corresponding to the previous identifier. 
If this is always the case, parsing fasta files is fairly simple. When 
the sequence that follows the identifier exists on multiple lines, the task
of subsetting sequences becomes more challenging. If we have a list of 
sequence identifiers in the file /ids.txt/, we can write a bash wrapper, 
incorporating the unix =cat= command, for the fasgrep command to search 
for this subset of sequences as follows:

#+BEGIN_SRC shell
for i in $(cat ids.txt); do fasgrep $i sequences.fas; done > subset.fas
#+END_SRC

Now we have a subest of sequences located in /subset.fas/ that correspond
to the identifiers listed in /ids.txt/. 

*** Note on searching for identifiers
Fasgrep works with perl regular expression syntax.It is often beneficial 
to code the identifiers in /ids.txt/ with some sort of a line or a word 
anchor. For example, if you are looking for a complete identifier like
 =sequence_1=
but have sequences in your file named =sequence_11= and =sequence_12=, 

#+BEGIN_SRC shell
fasgrep "sequence_1" sequences.fas
#+END_SRC

will return all three sequences. If
you'd like to just look for =sequence_1=, adding a line anchor, such as:

#+BEGIN_SRC shell
fasgrep "sequence_1$" sequences.fas
#+END_SRC

will return =sequence_1= and skip =sequence_11= and =sequence_12=. This
only works if =sequence_1= is at the end of your identifier. If you are
looking for =sequence_1= within a larger identifier, you may need to 
take extra precautions to not retrieve other names that are more
specific to the structure of your identifiers.