NAME
wurst - wonderful universal remedial structure terminator
SYNOPSIS
perl script_file
DESCRIPTION
use Wurst;
or, depending on the directory you are running from,
use FindBin;
use lib "$FindBin::Bin/../src/Wurst/blib/arch";
use lib "$FindBin::Bin/../src/Wurst/blib/lib";
use Wurst;
INTRODUCTION
This program does sequence to sequence or structure alignments and might do things like rank the consequent models.
The philosophy is that there is a set of low level routines coded in C. These are called from perl. To do anything useful, you write a little perl script which calls the functions you want.
DATA TYPES
We define certain kinds of pieces of data. In general, the interpreter calls a function which returns a pointer to some piece of data with a label. This may be passed into other function calls. When the last reference to the piece of data disappears, the object will be cleared up.
For example, seq_read()
returns something of type SeqPtr. This can then be passed to other functions like make_model()
.
The scheme is such that we have some kind of type checking. For example, make_model has an interface like
make_model ( p, seq, coord)
and the interpreter will only make the call if the first argument is a Pair_set, the second a sequence and the third some coordinates.
The set of data types includes:
- Aa_clssfcn
- Clssfcn
- Coord
- FXParam
- Pair_set
- Param
- Prob_vec
- Sec_s_data
- Seq
- Seq_array
- Sub_mat
- Score_mat
- Scor_set
- Seqprof
PROBABILITY VECTOR NOTES
This is a special section to point one in the right direction to read about the functions which create probability vectors. These are described in the full function list, but they can be hard to find, so they are gathered together here.
- seq_2_prob_vec
-
Create a probability vector from a sequence structure, using a classification that is purely sequence based.
- struct_2_prob_vec
-
Create a probability vector from a structure, using a classification that is purely sequence based.
- aa_strct_2_prob_vec
-
Create a probability vector from a structure, but using a classification based on sequence and structure information. Use both the sequence and structure data.
- aa_2_prob_vec SEQ CLSSFCN
- aa_2_prob_vec SEQ CLSSFCN NORM
-
Create a probability vector from a sequence SEQ, but using a classification CLSSFCN based on sequence and structure information. This is the same classification as above in
aa_strct_2_prob_vec
. The third argument can be used to turn off normalising of vectors by passing a value of 0 (zero). By default, NORM = 1.
FUNCTIONS
- ac_nclass AA_CLSSFCN
-
AA_CLSSFCN is an amino acid fragment classification. Return the number of classes in the classification.
- ac_dump AA_CLSSFCN AA_CLSSFCN is an amino acid fragment classfication. Print some information to stdout about the classication. This is really a placeholder. The information printed is not very comprehensive.
- ac_nclass AA_CLSSFCN AA_CLSSFCN is an amino acid fragment classfication. Return the number of classes.
- ac_size AA_CLSSFCN AA_CLSSFCN is an amino acid fragment classfication. Return the fragment length. The intention is only simple for the case of sequence classes. For structural classifications, there may be more than N descriptors for N residues.
- ac_read FNAME
-
Read an amino acid fragment classification. Return a Aa_clssfcn object if successful, undef otherwise.
- ac_size AA_CLSSFCN AA_CLSSFCN is an amino acid fragment classification. Return the fragment size used in the classification.
- blst_chk_read FNAME
-
Read a BLAST checkpoint file from FNAME. Return a Seqprof object if successful, undef otherwise.
- blst_chk_write FNAME Seqprof
-
Writes a BLAST checkpoint file to FNAME. Returns 1 if successful, and 0 otherwise. This profile is functionally equivalent to profiles written by psi-blast and used by wurst, but not byte-equivalent (due to floating point noise).
- get_clssfcn FNAME abs_error Reads an classification generated by AutoClass-C. Do not forget to give an absolut error value from your original data (it will be used for the integration in computeMembership() ). Returns a Clssfcn object if successful, undef otherwise.
- coord_2_bin COORD FNAME
-
Take a Coord structure and write it, in our proprietary .bin format, to FNAME. Returns 1 on success and undef on failure.
This function may be used when writing the results of a calculation, but more likely, it is used when converting PDB files to our
.bin
format. Look at pdb_read. - coord_c_n_dist COORD i, j, sqrt_flag
-
Given a Coord structure, COORD and the index of two residues, i and j return the distance, in Angstroms between the carbonyl oxygen of i and the backbone nitrogen atom of j if sqrt_flag is set to non-zero. If the flag is left as zero or undef, then the value returned will be the distance squared.
In a protein, the bondlength is about 1.32 Angstrom. Many scripts just want to check for continuous chains, so they do not need the real distance and it is faster to avoid the square root. For example, to check if the third and fourth residues are bonded, you might say
my $dist2 = 1.5 * 1.5; if (coord_c_n_dist ($coord, 2, 3, 0) > $dist2) { print "not connected\n"; } else { print "connected\n"; }
- coord_deletion COORD Start Coord_Length Seq_Length
-
Returns a pointer to a new Coord (structure) where an affine gap has been introduced, arising from the deletion of some sequence or coordinate data from an existing structure. The gap is specified by its Start [0,size), and number of residues excised from the structure (Coord_Length) and sequence (Seq_Length).
Minimal checking ensures gap specifications are sensible. This function is provided for the generation of synthetic threading results and other forms of improper structure sets.
- coord_geo_gap COORD SCALE MAX
-
Calculate a penalty based on the geometric damage in a structure. This is based on the carbonyl carbon to amide nitrogen distance between each pair of residues which should be connected. The ideal C..N distance is 1.32 Angstrom. No penalty is enforced if the distance is less than 2.0 Angstrom. COORD is a set of coordinates. SCALE is the number with which penalties will be multiplied. MAX is a limit on the distance to be considered. For each gap, if it is bigger than MAX, it is set to MAX.
This function returns a list of values like:
($quadratic, $linear, $logistic, $num_gap) = coord_geo_gap ($coord, $scale, $max);
Where
- $quadratic
-
Sum over (distance - ideal)^2 for each gap, where ideal is hard coded to about 1.32 Angstrom.
- $linear
-
This is the sum over (distance - ideal) for each gap.
- $logistic
-
This uses a logistic activation function. The penalty is the sum of a fancy logistic activation function applied to each gap.
- $num_gap
-
This is the number of gaps, regardless of length.
- coord_get_seq COORD
-
Returns a pointer to a Seq (sequence) given a pointer to a Coord.
- coord_has_sec_s COORD
-
Returns non-zero if COORD has defined secondary structure.
- coord_get_sec_s COORD
-
Returns a string containing the secondary structure assignment for the coordinates, if they exist. Side-effect: a non-serious warning message is generated if COORD has no assigned secondary structure.
- coord_name COORD
-
Print the PDB acquisition code and chain identifier from the coordinates, COORD (an object of
Coord
type). - coord_read FNAME
-
Read coordinates from file FNAME. Returns a Coord pointer.
- coord_rmsd PAIR_SET, COORD1, COORD2, SUBSET_FLAG
- coord_rmsd PAIR_SET, COORD1, COORD2
-
Superimpose COORD1 onto COORD2 and calculate the root mean square difference of coordinates in Angstroms, based on an alignment. Return a list with shifted coordinates and the rmsd value like this
($rmsd, $new_c1, $new_c2) = rmsd ($pair_set, $c1, $c2)
or
($rmsd, $new_c1, $new_c2) = rmsd ($pair_set, $c1, $c2, SUBSET_FLAG)
$rmsd is the root mean square difference. $new_c1 is a coordinate object which has been moved. It may be smaller than $c1 (see note below about SUBSET_FLAG). $new_c2 may be identical to $c2.
$pair_set has come from some kind of alignment such as sequence to sequence or structure to structure.
$c1
are the coordinates to be moved.$c2
are the reference coordinates. If defined, the optional argument, SUBSET_FLAG, means that one walks down the list coordinates of$c1
and$c2
and copies into$new_c1
and$new_c2
those sites which were aligned.If the alignment of two proteins includes most of the original coordinates, then one will not usually want to define SUBSET_FLAG. If the two proteins are very different sizes, it makes sense to define this flag (and only see the similar regions). If SUBSET_FLAG is not set, $new_c2 is an exact copy of $c2.
- coord_size COORD
-
Return an integer with the number of residues in COORD, a
Coord
object. - coord_2_pdb FNAME COORD SEQ
- coord_2_pdb FNAME COORD
-
Write the coordinates from COORD (a Coord pointer) to the filename given by FNAME. Returns nonzero on success and the undefined value otherwise.
If the optional last argument, SEQ, is present, it should be a sequence object and will be used to print the original (hopefully) complete sequence in the pdb file in standard, PDB, SEQRES format records.
- coord_2_spdb FNAME COORD SCOR_SET
- coord_2_spdb FNAME COORD SCOR_SET SEQ
-
Identical to coord_2_pdb (above), this function additionally writes a temperature value in the generated PDB file, corresponding to the score given in SCOR_SET (a Scor_set pointer) for each residue in COORD.
See the entry for the function scor_set_simpl (below) for an example.
- coord_2_pnlty COORD VALUE
-
COORD is a
Coord
type object. VALUE is a floating point number. Returns a FloatPtr (floating point array) which will be used for extra gap penalty weights during alignments. At each site in COORD, see if it is in secondary structure. If so, it gets an extra weight, given by VALUE. The exact method is more complicated.If we have a structure which looks like
1 2 3 4 5 6 7 - E E E H - -
The we say that "E" and "H" refer to extended (beta strand) and helix, so residues 2 - 5 look as if they are in interesting secondary structure. In fact, the situation is more interesting. Residue 1 is adjoining a piece of secondary structure, and residue 2 is on the edge of one. Only residues 3 - 4 are really definitely inside secondary structure and residue 7 is definitely outside one. If VALUE is set to 10, then the coefficients will be
residue coefficient 1 3.25 2 7.75 3 10 4 10 5 7.75 6 3.25 7 1
- dmat_b_cliques MODEL REFERENCE BOUND SIZE
-
Returns a list of intervals on MODEL's coordinates corresponding to 'flat' areas of the difference of distance matrices between MODEL and REFERENCE (which could be the native fold of the sequence modelled by MODEL).
The list should be interpreted as ranges in the residues of MODEL :
start1, end1, start2, end2, ...
These define contiguous stretches of MODEL's structure whose distance contacts differ by less than BOUND angstroms from those in REFERENCE. Each interval is at least of size SIZE.
There are some default parameters :
my @good_regions_of_model = dmat_b_cliques $model, $native;
Returns the flat regions longer than 9 residues where DME is less than 1 angstrom.
- dme_thresh FRACTION COORD1 COORD2 THRESHOLD
-
Compare the distance matrices from COORD1 and COORD2. Calculate the fraction of the distance matrix left after setting a threshold of THRESHOLD. Put this fraction into FRACTION.
Returns the useful answer in FRACTION, but returns undef on error.
DME stands for "distance matrix error", a name coined in Havel, TF, Biopolymers, 29, 1565-1585 (1990), but it is really the root mean square difference of distance matrices. To compare structures we calculate the DME based on alpha carbon coordinates. We then go back to the distance original distance matrices and find the most different matrix elements and enter a loop:
while (DME > threshold) remove most different element from calculation return the fraction of the distance matrix remaining
If the two sets of coordinates were very similar, then we did not have to remove any matrix elements, so we return a value near 1.0. If the coordinates were very different, we had to throw away most matrix elements, so we return a number closer to 0.0.
A good value for THRESHOLD is about 3.5 or 4.0 Angstrom. This measure of similarity is bounded by 0 and 1.0 and is not too sensitive to protein size.
- dme_nice MODEL REF BOUND SIZE CORE CORE_BORDER
-
Returns a list of intervals defining bounded regions of the difference of distance matrices between MODEL and REF. Whereas in dme_b_cliques, the intervals define bounded regions, the 'nice' regions defined by this function allow for some deviations at the boundary of the flat regions of a DME plot. SIZE refers to the structural fragment size, and CORE refers to the minimum size of a bounded region expressed as the number of overlapping fragments. CORE_BORDER is the size of the boundary region where deviation is allowed. Essentially this means that pairs of intervals that would be returned by dmat_b_cliques will be merged if they are separated by fewer than CORE_BORDER residues.
- dssp COORD
-
Run the DSSP program on the coordinates in COORD. Store the answer in COORD.
- make_model PAIR_SET SEQ COORD
-
Given an alignment in PAIR_SET (a Pair_set pointer) which comes from sequence SEQ aligned to coordinates COORD, return a Coord pointer with a model of the sequence on the coordinates.
- model_pdb_num COORD MODELNUM
-
Given a COORD OBJECT and a residue position, returns the original PDB number stored for that residue.
- model_res_num COORD RESIDUENUM
-
Given a COORD OBJECT and a residue number (according to PDB sequence numbering), returns the index for that residue in the coord object, or undefined if there is no coordinates for that sequence number.
- num_seq SEQ_ARRAY
-
Returns the number of sequences in a sequence array.
- pair_set_coverage PAIR_SET SIZE1 SIZE2
-
Given an alignment, we often want to know what part of the sequence or structure is accounted for. For example
A B C - - D E - F G H Q R S T U V W - Y Z
In the first string, you could imagine the residues we know about are like
0 0 1 1 1 1 0 1
while in the second string, the pattern is like
1 0 0 1 1 0 1 1
Call this function with a pair set and the size of the first and second objects (sequences or sequence/structure pair). pair_set_coverage returns two lists. One for each member of the alignment respectively.
This returns a pair of strings. In each string, a '0' char means the site is not covered by the alignment, but a '1' means it is. On failure, returns undef. A typical use would be
my ($s1, $s2) = pair_set_coverage ($pair_set, seq_size ($seq), coord_size ($coord); print "Pattern of sequence coverage looks like\n$s1\n"; print "Pattern of struct coverage looks like \n$s2\n";
and the output strings might look like
00100111
Which would mean that the 3rd, 6th and 7th positions were properly aligned. Note also that this leaves room for some tricks. Strings can be binary &'d to look for overlap and '1's can be quickly counted using perl matching operators.
- pair_set_extend PAIR_SET $EXT_LONG
- pair_set_extend PAIR_SET $EXT_SHORT
-
If you have done a Smith and Waterman style alignment, you may not have all residues aligned. Consider
A B C D Q R S E X Y Q R S X Y
The alignment from a highest scoring segment is
Q R S Q R S
and this is what we will likely get if we don't ask otherwise. but we could reasonably do a short extension to include residues which go out to the ends of the shorter sequence.
C D Q R S E X Y Q R S X
One may also want a longer extension, which would look like
A B C D Q R S E - - - X Y Q R S X Y
Given these definitions of extension, (long and short), you may want either result.
The symbols,
$EXT_LONG
and$EXT_SHORT
are constants. You should choose one of them. - pair_set_gap PAIR_SET SCALE_OPEN SCALE_WIDEN
-
Like
coord_geo_gap
, this is for calculating extra gap penalties after doing an alignment. This returns a gap penalty for the sequence, not the structure. It returns two values likemy ($open_penalty, $widen_penalty); ($open_penalty, $widen_penalty) = pair_set_gap($pair_set, $o_scl, $w_scl);
For each gap, the penalty is calculated as
penalty = open_scale + (n - 1) * widen_scale
where
n
is the length of the gap. Because these things are linear, you can get the number of gaps in a whole alignment fromn_gap = $open_penalty / $open_scale
. - pair_set_score PAIR_SET
-
This returns two floating point numbers. The first is the score from an alignment, including gaps. The second is the score, not including gaps. The intentions are to
Let one look at the pure sequence-structure score and
Let one use a more sophisticated gap scoring scheme than the one used in the alignment calculation.
For example, we have have something like
my $pair_set = score_mat_sum_sec (.....); my ($score, $simple_score) = pair_set_score ($pair_set); print "Score with gaps is ", $score, "without gaps is ", $simple_score, "\n";
For compatibility with old scripts, it is still OK to say
my $score = pair_set_score ($pair_set)
- pair_set_string PAIR_SET SEQ1 SEQ2
-
Return a string for printing. PAIR_SET is an alignment corresponding to SEQ1 and SEQ2. If this came from a sequence to structure alignment, we still want the sequence corresponding to SEQ2. This is easy to get from
coord_get_seq (C)
but this routine should be renamed
seq_from_coord()
. - pair_set_pretty_string PAIR_SET SEQ1 SEQ2 SEC_S_DATA COORD2
- pair_set_pretty_string PAIR_SET SEQ1 SEQ2
-
Return a string with a pretty alignment.
If you have secondary structure data (SEC_S) which has probably come from a predictor and if you provide the coordinates (COORD2), then the output will include this secondary structure information.
- param_fx_read FNAME
-
Read parameters for Thomas' Bayesian scoring / alignment functions. Returns an FXParam object on success, or undefined on failure.
- param_rs_read FNAME
-
Read parameters for the tanh() based score function with all atoms. This is for rescoring, not alignment. Returns an RSParam object on success, or undefined on failure.
- pdb_read FNAME ACQ_CODE CHAIN
-
Read the pdb file, FNAME, and return a Coord object. Note that this will not include secondary structure information. ACQ_CODE is the acquisition code. It can be blank, in which case, the program will try to guess the code from the filename. CHAIN can also be blank, or a single letter, chain specifier.
This function is for rewriting lots of files, for example, when we rebuild a library. This means it tolerates errors rather than stopping. Furthermore, it contains a list of residue name substitutions. For example, it will change
original new_name comment --------------------------- UNK ALA Unknown residues become alanine MSE MET Selenomethionine becomes methionine CSE CYS Selenocysteine becomes cystein
There are currently more than 20 of these conversions. They are a one way affair. After reading the coordinates, the original name is thrown away.
- prob_vec_info PROB_VEC
-
PROB_VEC is a matrix of class probabilities. This returns a string containing some minimal information. Functions like "seq_2_prob_vec" return a probability vector.
- scor_set_simpl PAIR_SET SCORE_MAT
-
Returns a Scor_set object, corresponding to the local score for each aligned pair of residues in PAIR_SET, based on the scores in SCORE_MAT. These local scores, when summed, yield the 'simple score' returned by pair_set_score.
The obvious use of this function is to generate an annotated PDB file :
coord_2_spdb ("1mypC.pdb", make_model($pair_set, $seq, $template) , scor_set_simpl($pair_set, $tot_score_mat));
Where the temperature entry of the each residue in the model will be set to the local score given in the score matrix.
- scor_set_fromvec @SCORES
-
Returns a Scor_set object containing the vector of scores given in @SCORES.
- scor_set_to_vec $scor_set
-
Returns a vector of doubles corresponding to the scores contained in the Scor_set object. This is useful when you want to actually print out the scores for an alignment :
# float_to_character maps a signed float to a # scale of alphanumerics sub d2c( $ ) { return float_to_character ( $_ ) }; my $i=0; my $cs = pair_set_coverage( $pairset, $seq_len, $coord_len ); my $scor = scor_set_to_vec( $scor_set); # from $pair_set $cs=~ s/0/./g; $cs=~ s/1/d2c($$scor[$i++]);/ge; # $cs is now a coverage string where every aligned position is # scored according to float_to_character.
- scor_set_scale SCOR_SET SCALE
-
Returns true if it managed to divide each entry in the Scor_set by the scalar in SCALE.
- score_fx MATRIX SEQ COORD PARAM
-
Fill out a score matrix for a sequence structure pair, based on Herr Doktor Huber's Bayesian-based score function. This score function works for sequence to structure alignments.
MATRIX is a new created score matrix, probably returned from score_mat_new. SEQ is a sequence object (SeqPtr) COORD is a coordinate object (CoordPtr). PARAM are parameters from
param_fx_read
. They have type FXParam and this type checking is enforced by the interpreter. - score_fx_prof MATRIX SP COORD PARAM
-
This is the same as score_fx, but the second argument,
SP
, is a sequence profile, rather than a sequence. - score_mat_add (MAT1, MAT2, SCALE, SHIFT)
-
Return a new score matrix where each element is given by
NEW_MAT = MAT1 + ( SCALE * MAT2 + SHIFT)
If only three arguments are given, SHIFT is set to zero.
- score_mat_info MAT
-
Given a Score_mat object in MAT, return a list of the form (MIN, MAX, AV, STD_DEV) Ignore the first and last row and column, since they are used for treatment of end gaps.
Note, the calling procedure for this function changed in May 2006, so old scripts may break;
- score_mat_new N_ROWS N_COLS
-
Create a new score matrix and return a Score_mat object. In a sequence to structure alignment, N_ROWS is the size of the sequence, N_COLS is the size of the structure. So, if
$seq
is a sequence and$coord
is a structure, we might have$scr_mat = score_mat_new (seq_size ($seq), coord_size ($coord));
You are entitled to assume that the new matrix is properly zeroed. Code should not bother to zero it again.
- score_mat_scale MAT, SCALE
-
Multiply all elements in MAT by SCALE. This returns a new matrix.
- score_mat_shift (MAT1, SHIFT)
-
Return a new score matrix where each element is given by
NEW_MAT = MAT1 + SHIFT
The first and last row and column are not shifted. This is one way to implement end gaps. This function returns a new matrix.
- score_pvec MATRIX, PVEC1, PVEC2
-
Fill out the score matrix, MATRIX based on comparing the class membership probability vectors PVEC1 and PVEC2. These probability vectors check the length of the original fragments and will return undef if they do not match.
- score_mat_read FNAME
-
Go to FNAME and read the score matrix that was probably written by score_mat_write. Return a new score matrix or undef on failure like
my $scr_mat = score_mat_read ($fname); if (! $scr_mat) { print STDERR "Reading score matrix from $fname failed.\n"; }
- score_mat_string SCORE_MAT SEQ1 SEQ2
-
This is mainly for debugging. After calling a score function, this will return a string containing the score matrix with the sequence at the top and left hand side. If you did a sequence to structure alignment, you should still call it with the sequences.
- score_mat_sum_full RMAT, SCORE_MAT, PGAP_OPEN, PGAP_WIDEN, QGAP_OPEN, QGAP_WIDEN,P_MULT, Q_MULT, ALIGN_TYPE
- score_mat_sum_full RMAT, SCORE_MAT, PGAP_OPEN, PGAP_WIDEN, QGAP_OPEN, QGAP_WIDEN,P_MULT, Q_MULT, ALIGN_TYPE, BIAS_SET
-
This does the summation of a score matrix. Returns non-zero on success and zero on failure. Note, there is an optional last argument. The parameters are
- RMAT
-
This is used for returning the summed matrix. The function does not overwrite SCORE_MAT. Instead it returns a new matrix. This can generally be ignored.
- SCORE_MAT
-
This is the score matrix which came from a call to something like score_smat.
- PGAP_OPEN
-
Penalty for opening gaps in the first of the pair of sequences or structures or whatever.
- PGAP_WIDEN
-
Penalty for extending the PGAPs.
- QGAP_OPEN
-
Penalty for opening gaps in the second of the pair.
- QGAP_WIDEN
-
Penalty for extending the QGAPs
- P_MULT
-
This is an object of
FloatPtr
type with site specific coefficients for gap penalties and will be applied to the P_GAP. Typically, this will be for extra penalties for secondary structure gaps. - Q_MULT
-
As for P_MULT, but for the q_gaps.
- ALIGN_TYPE
-
Either "$N_AND_W" or "$S_AND_W" for Needleman and Wunsch or Smith and Waterman respectively.
- BIAS_SET
-
This is a pair_set which may have come from a previous alignment. It really is an object of Pair_setPtr type. It is optional. If present, the alignment will be coerced to follow this alignment. The use is that one can do an Smith and Waterman alignment and get the best scoring segment. One can then set the ALIGN_TYPE switch to
$N_AND_W
and do a globally optimal alignment, but passing through the optimal segment.
- score_mat_write SCORE_MAT FNAME
-
Write the score matrix SCORE_MAT to the file name given by FNAME. Return undef on failure.
- score_rs COORD PARAMS
-
This will apply the tanh() based score function which is primarily for rescoring. It is NOT neighbour-non-specific or unusual in any other way.
COORD is a Coord object. PARAMS is a set of parameters, formally of type RSParams and probably resulting from a call to param_rs_read.
Returns a floating point number.
- score_smat SCORE_MAT, SEQ1, SEQ2, SUB_MAT
-
Fill out a score matrix based on sequence comparison of two sequences. SCORE_MAT is a score matrix which may have come from new_matrix. SEQ1 and SEQ2 are the sequences. SUB_MAT is a substitution matrix.
- score_sprof SCORE_MAT, PROF, SEQ, SUB_MAT
-
This is very similar to score_smat. SCORE_MAT is a score matrix, probably from new_matrix. PROF is a sequence profile, maybe from blst_chk_read. SEQ is a sequence and SUB_MAT is a substition matrix. This fills out a score matrix using the substition matrix it has been given and fractional sequences from the sequence profile.
- score_sec_s SCORE_MAT, SEC_S_DATA, COORD
-
Given a SCORE_MAT which probably came from
score_mat_new
, and some secondary structure data in SEC_S_DATA, probably fromsec_s_data_read
and a set of coordinates in COORD, do a scoring.We compare each site against the predicted secondary structure. If the psi angle difference is more than 90 degrees, we set it to 90. We then return cos (difference). Note, this is different behaviour to earlier versions. If an angle is very different (diff > 90), it it not penalised. If the angle is less than 90, a value will be returned between 0 and 1.0
- sec_s_data_read FNAME
-
Read secondary structure data from FNAME. This may be in PHD/Rost format or our "item_manual_secondary_structure_file" in "manual input format" format defined below. Return a pointer to an object of Sec_s_data type.
- sec_s_data_string SEC_S
-
SEC_S is a pointer to a Sec_s_data object. Return a string with the bare secondary structure information. For example:
$x = sec_s_data_read ("sec_data_file") || die "Fail on $tmp_data: $!"; print sec_s_data_string($x);
- seq_2_prob_vec SEQ, AA_CLSSFCN Given a sequence, SEQ and an amino acid classification, AA_CLSSFCN return an object of PROB_VEC type. This is most interesting for feeding to a score function like score_pvec.
- seq_read FNAME
-
Read a sequence from file FNAME. Returns a SeqPtr object.
- seq_read_many FNAME
-
Reads a number of sequences from file FNAME. Returns a SeqArrayPtr object (array of sequences).
- seq_from_string STRING
-
Returns a sequence object
SeqPtr
from a string. This lets one build a sequence in the interpreter like$seq = seq_from_string ('avlc');
Would store a sequence (Ala, Val, Leu, Cys) in the $seq variable. The string can have a FASTA style comment at the start. The string can span multiple lines.
- seq_get_1 SEQ_ARRAY N
-
Return the N'th string from a sequence array, SEQ_ARRAY. Returns a SeqPtr object.
- seq_num SEQ_ARRAY
-
Returns the number of sequences stored in the sequence array, SEQ_ARRAY.
- seq_print SEQ
-
Return a sequence to the interpreter as a string. It is not pretty and will be fixed.
The sequence will be beautified so it comes out in blocks of ten residues, probably with numbering.
- seq_print_many SEQ_ARRAY
-
Returns a string containing all the sequences in SEQ_ARRAY.
- seq_size SEQ
-
Return the number of residues in a sequence, SEQ.
This is the number of residues in a sequence. There are no complications or frills like null terminators.
- seqprof_get_seq SEQPROF
-
This is analogous to coord_get_seq. It takes a sequence profile and returns a Seq sequence.
- seqprof_str SEQPROF
-
Take the sequence profile in the SEQPROF object and return a printable string. For example
if ( ! ($profile = blst_chk_read ( $fname))) { return undef; } print seqprof_str ($profile);
- struct_2_prob_vec COORD CLSSFCN Computes the memberships matrix (probability vector) for a structure COORD given a CLSSFCN
- sub_mat_get_by_i SUB_MAT, M, N
-
Return the substition matrix element indexed by M and N
- sub_mat_get_by_c SUB_MAT, A, B
-
Return the substitution matrix element for the amino acids A and B. So
$a = sub_mat_get_by_c ($sub_mat, 'c', 'd');
would return the current value for cys/asp. The amino acid names are single letter codes and may be upper or lower case. - sub_mat_set_by_c SUB_MAT, A, B, val
-
Set the substitution matrix element for A and B to val.
- sub_mat_set_by_i SUB_MAT, M, N, val
-
Set the value indexed by M,N in the substition matrix to val.
- sub_mat_string SUB_MAT
-
Return a string containing the substitution / score matrix held in SUB_MAT.
- sub_mat_read (FILENAME)
-
Go to FILENAME. Read up the substitution / score matrix and return it.
- sub_mat_shift SUBST_MATRIX, BOTTOM
-
Given a substitution matrix (an object of type Sub_mat), shift the whole matrix so the smallest (most negative value) is of size BOTTOM. This does not return anything. It acts on the SUBST_MATRIX argument directly.
- sub_mat_scale SUBST_MATRIX, BOTTOM, TOP
-
Given a substition matrix, SUBST_MATRIX, scale and shift it so the minimum and maximum values run from BOTTOM to TOP.
- score_mat_sum_smpl NEW_MAT SCORE_MAT PGAP_OPEN PGAP_WIDEN QGAP_OPEN QGAP_WIDEN ALIGNMENT_TYPE
-
We have a score matrix which could be from sequence/sequence, sequence/structure or whatever. Now, do the dynamic programming work. Sum the score matrix and return a set of pairs.
The parameters are
- NEW_MAT
-
This is a fresh matrix with the traced back scores in it.
- SCORE_MAT
-
This is the score matrix.
- GAP_OPEN
- GAP_WIDEN
- ALIGNMENT_TYPE
-
There are only two values allowed, either
$N_AND_W
or
$S_AND_W
These stand for "Needleman and Wunsch" and "Smith and Waterman" respectively. Any other value will cause an error.
- svm_rs_cdata MODEL NATIVE SCOR_SET RS_PARAM CVTYPE
-
*EXPERIMENTAL!*
The function returns an array of training vectors suitable for use in training a support vector machine (libSVM.pm) or some other machine learning procedure. Its form is :
[ [label_class, [(feature vector)], .. ]
The scheme for calculating the training vectory is given in CVTYPE, and the data is formed from the local sequence to structure scores as given by SCOR_SET, the TANH forcefield based pairwise interaction terms (calculated via RS_PARAM), and the local model consistency (based on the difference of distance matrices computed between MODEL and NATIVE).
Scheme 0 works as follows : (see scoranlys.c:get_svmdata for details at the moment).
- svm_rsfeat MODEL SCOR_SET RS_PARAMS CVTYPE
-
This returns a set of feature vectors for each position in MODEL, calculated from local sequence-structure fitness and residue-specific interaction terms according to the CVTYPE scheme (see svm_rs_cdata or scoranlys.c for details). The form is as follows :
my @m_fvset = svm_rsfeat MODEL, SCOR_SET, PARAMS, 0 @m_fvset is of form [ (undef), [feature vector], .., .., (undef)] and (scalar @m_fvset) == coord_size(MODEL)
undefs are given for positions in the model where a full feature vector cannot be computed (at the ends, for instance).
BUILD AND INSTALL
Generally, nothing except the top level Makefile should be changed. In the top level directory, edit the Makefile, then type
make
If this looks OK, you might
cd scripts
perl hello.pl
This will check if Wurst pieces appear to be in place. If that looks OK, then edit wurst/src/Wurst/Makefile.PL to set the installation destination. Then
make install
from the top level directory. Then go back to the scripts directory and try a different file like
cd scripts
perl hello2.pl
This will attempt to run a test using something like "lib" as the destination directory.
FILE FORMATS
- PHD secondary structure files
-
Wurst can read secondary structures from the PHD prediction server. The format is empirically defined. That means we try to read anything that comes from the server.
- manual secondary structure file
-
One may type in secondary structure predictions or assignments. The format is like:
# speculative secondary structure assignments secondary structure 5 - 30 h 37 e 40-45 e 5 # This is a very unreliable guess
This means that residues 5 to 30 are helix. Residue 37 is sheet (extended). Residues 40 to 45 are sheet (extended) and they have a confidence level of 5.
The first non blank line must say "secondary structure". This is not optional. It is used to recognise the file format.
Confidence levels can be given from 0 to 9. Zero means there is no confidence. 9 means you are very confident. This number must be an integer. There is no way to give a more detailed number.
Confidence levels are optional. The default is confidence=9.
Anything after a hash
#
is a comment.Blank lines are ignored.
AUTHORS
Alphabetically... Steve Hoffmann, Thomas Huber, Tina Lai, Nasir Mahmood, Thomas Margraf, Martin Mosisch, James B. Procter, Gundolf Schenk, Andrew E. Torda.
SEE ALSO
coding.pod contains rules for adding to wurst.