NAME

wurst - wonderful universal remedial structure terminator

SYNOPSIS

perl script_file

DESCRIPTION

use Wurst;

or, depending on the directory you are running from,

use FindBin;
use lib "$FindBin::Bin/../src/Wurst/blib/arch";
use lib "$FindBin::Bin/../src/Wurst/blib/lib";
use Wurst;

INTRODUCTION

This program does sequence to sequence or structure alignments and might do things like rank the consequent models.

The philosophy is that there is a set of low level routines coded in C. These are called from perl. To do anything useful, you write a little perl script which calls the functions you want.

DATA TYPES

We define certain kinds of pieces of data. In general, the interpreter calls a function which returns a pointer to some piece of data with a label. This may be passed into other function calls. When the last reference to the piece of data disappears, the object will be cleared up.

For example, seq_read() returns something of type SeqPtr. This can then be passed to other functions like make_model().

The scheme is such that we have some kind of type checking. For example, make_model has an interface like

make_model ( p, seq, coord)

and the interpreter will only make the call if the first argument is a Pair_set, the second a sequence and the third some coordinates.

The set of data types includes:

Aa_clssfcn
Clssfcn
Coord
FXParam
Pair_set
Param
Prob_vec
Sec_s_data
Seq
Seq_array
Sub_mat
Score_mat
Scor_set
Seqprof

PROBABILITY VECTOR NOTES

This is a special section to point one in the right direction to read about the functions which create probability vectors. These are described in the full function list, but they can be hard to find, so they are gathered together here.

seq_2_prob_vec

Create a probability vector from a sequence structure, using a classification that is purely sequence based.

struct_2_prob_vec

Create a probability vector from a structure, using a classification that is purely sequence based.

aa_strct_2_prob_vec

Create a probability vector from a structure, but using a classification based on sequence and structure information. Use both the sequence and structure data.

aa_2_prob_vec SEQ CLSSFCN
aa_2_prob_vec SEQ CLSSFCN NORM

Create a probability vector from a sequence SEQ, but using a classification CLSSFCN based on sequence and structure information. This is the same classification as above in aa_strct_2_prob_vec. The third argument can be used to turn off normalising of vectors by passing a value of 0 (zero). By default, NORM = 1.

FUNCTIONS

ac_nclass AA_CLSSFCN

AA_CLSSFCN is an amino acid fragment classification. Return the number of classes in the classification.

ac_dump AA_CLSSFCN AA_CLSSFCN is an amino acid fragment classfication. Print some information to stdout about the classication. This is really a placeholder. The information printed is not very comprehensive.
ac_nclass AA_CLSSFCN AA_CLSSFCN is an amino acid fragment classfication. Return the number of classes.
ac_size AA_CLSSFCN AA_CLSSFCN is an amino acid fragment classfication. Return the fragment length. The intention is only simple for the case of sequence classes. For structural classifications, there may be more than N descriptors for N residues.
ac_read FNAME

Read an amino acid fragment classification. Return a Aa_clssfcn object if successful, undef otherwise.

ac_size AA_CLSSFCN AA_CLSSFCN is an amino acid fragment classification. Return the fragment size used in the classification.
blst_chk_read FNAME

Read a BLAST checkpoint file from FNAME. Return a Seqprof object if successful, undef otherwise.

blst_chk_write FNAME Seqprof

Writes a BLAST checkpoint file to FNAME. Returns 1 if successful, and 0 otherwise. This profile is functionally equivalent to profiles written by psi-blast and used by wurst, but not byte-equivalent (due to floating point noise).

get_clssfcn FNAME abs_error Reads an classification generated by AutoClass-C. Do not forget to give an absolut error value from your original data (it will be used for the integration in computeMembership() ). Returns a Clssfcn object if successful, undef otherwise.
coord_2_bin COORD FNAME

Take a Coord structure and write it, in our proprietary .bin format, to FNAME. Returns 1 on success and undef on failure.

This function may be used when writing the results of a calculation, but more likely, it is used when converting PDB files to our .bin format. Look at pdb_read.

coord_c_n_dist COORD i, j, sqrt_flag

Given a Coord structure, COORD and the index of two residues, i and j return the distance, in Angstroms between the carbonyl oxygen of i and the backbone nitrogen atom of j if sqrt_flag is set to non-zero. If the flag is left as zero or undef, then the value returned will be the distance squared.

In a protein, the bondlength is about 1.32 Angstrom. Many scripts just want to check for continuous chains, so they do not need the real distance and it is faster to avoid the square root. For example, to check if the third and fourth residues are bonded, you might say

my $dist2 = 1.5 * 1.5;
if (coord_c_n_dist ($coord, 2, 3, 0) > $dist2) {
    print "not connected\n";
} else {
    print "connected\n";
}
coord_deletion COORD Start Coord_Length Seq_Length

Returns a pointer to a new Coord (structure) where an affine gap has been introduced, arising from the deletion of some sequence or coordinate data from an existing structure. The gap is specified by its Start [0,size), and number of residues excised from the structure (Coord_Length) and sequence (Seq_Length).

Minimal checking ensures gap specifications are sensible. This function is provided for the generation of synthetic threading results and other forms of improper structure sets.

coord_geo_gap COORD SCALE MAX

Calculate a penalty based on the geometric damage in a structure. This is based on the carbonyl carbon to amide nitrogen distance between each pair of residues which should be connected. The ideal C..N distance is 1.32 Angstrom. No penalty is enforced if the distance is less than 2.0 Angstrom. COORD is a set of coordinates. SCALE is the number with which penalties will be multiplied. MAX is a limit on the distance to be considered. For each gap, if it is bigger than MAX, it is set to MAX.

This function returns a list of values like:

($quadratic, $linear, $logistic, $num_gap)
             = coord_geo_gap ($coord, $scale, $max);

Where

$quadratic

Sum over (distance - ideal)^2 for each gap, where ideal is hard coded to about 1.32 Angstrom.

$linear

This is the sum over (distance - ideal) for each gap.

$logistic

This uses a logistic activation function. The penalty is the sum of a fancy logistic activation function applied to each gap.

$num_gap

This is the number of gaps, regardless of length.

coord_get_seq COORD

Returns a pointer to a Seq (sequence) given a pointer to a Coord.

coord_has_sec_s COORD

Returns non-zero if COORD has defined secondary structure.

coord_get_sec_s COORD

Returns a string containing the secondary structure assignment for the coordinates, if they exist. Side-effect: a non-serious warning message is generated if COORD has no assigned secondary structure.

coord_name COORD

Print the PDB acquisition code and chain identifier from the coordinates, COORD (an object of Coord type).

coord_read FNAME

Read coordinates from file FNAME. Returns a Coord pointer.

coord_rmsd PAIR_SET, COORD1, COORD2, SUBSET_FLAG
coord_rmsd PAIR_SET, COORD1, COORD2

Superimpose COORD1 onto COORD2 and calculate the root mean square difference of coordinates in Angstroms, based on an alignment. Return a list with shifted coordinates and the rmsd value like this

($rmsd, $new_c1, $new_c2) = rmsd ($pair_set, $c1, $c2)

or

($rmsd, $new_c1, $new_c2) = rmsd ($pair_set, $c1, $c2, SUBSET_FLAG)

$rmsd is the root mean square difference. $new_c1 is a coordinate object which has been moved. It may be smaller than $c1 (see note below about SUBSET_FLAG). $new_c2 may be identical to $c2.

$pair_set has come from some kind of alignment such as sequence to sequence or structure to structure. $c1 are the coordinates to be moved. $c2 are the reference coordinates. If defined, the optional argument, SUBSET_FLAG, means that one walks down the list coordinates of $c1 and $c2 and copies into $new_c1 and $new_c2 those sites which were aligned.

If the alignment of two proteins includes most of the original coordinates, then one will not usually want to define SUBSET_FLAG. If the two proteins are very different sizes, it makes sense to define this flag (and only see the similar regions). If SUBSET_FLAG is not set, $new_c2 is an exact copy of $c2.

coord_size COORD

Return an integer with the number of residues in COORD, a Coord object.

coord_2_pdb FNAME COORD SEQ
coord_2_pdb FNAME COORD

Write the coordinates from COORD (a Coord pointer) to the filename given by FNAME. Returns nonzero on success and the undefined value otherwise.

If the optional last argument, SEQ, is present, it should be a sequence object and will be used to print the original (hopefully) complete sequence in the pdb file in standard, PDB, SEQRES format records.

coord_2_spdb FNAME COORD SCOR_SET
coord_2_spdb FNAME COORD SCOR_SET SEQ

Identical to coord_2_pdb (above), this function additionally writes a temperature value in the generated PDB file, corresponding to the score given in SCOR_SET (a Scor_set pointer) for each residue in COORD.

See the entry for the function scor_set_simpl (below) for an example.

coord_2_pnlty COORD VALUE

COORD is a Coord type object. VALUE is a floating point number. Returns a FloatPtr (floating point array) which will be used for extra gap penalty weights during alignments. At each site in COORD, see if it is in secondary structure. If so, it gets an extra weight, given by VALUE. The exact method is more complicated.

If we have a structure which looks like

1 2 3 4 5 6 7
- E E E H - -

The we say that "E" and "H" refer to extended (beta strand) and helix, so residues 2 - 5 look as if they are in interesting secondary structure. In fact, the situation is more interesting. Residue 1 is adjoining a piece of secondary structure, and residue 2 is on the edge of one. Only residues 3 - 4 are really definitely inside secondary structure and residue 7 is definitely outside one. If VALUE is set to 10, then the coefficients will be

residue  coefficient
1        3.25
2        7.75
3       10
4       10
5        7.75
6        3.25
7        1
dmat_b_cliques MODEL REFERENCE BOUND SIZE

Returns a list of intervals on MODEL's coordinates corresponding to 'flat' areas of the difference of distance matrices between MODEL and REFERENCE (which could be the native fold of the sequence modelled by MODEL).

The list should be interpreted as ranges in the residues of MODEL :

start1, end1, start2, end2, ...

These define contiguous stretches of MODEL's structure whose distance contacts differ by less than BOUND angstroms from those in REFERENCE. Each interval is at least of size SIZE.

There are some default parameters :

my @good_regions_of_model =
    dmat_b_cliques $model, $native;

Returns the flat regions longer than 9 residues where DME is less than 1 angstrom.

dme_thresh FRACTION COORD1 COORD2 THRESHOLD

Compare the distance matrices from COORD1 and COORD2. Calculate the fraction of the distance matrix left after setting a threshold of THRESHOLD. Put this fraction into FRACTION.

Returns the useful answer in FRACTION, but returns undef on error.

DME stands for "distance matrix error", a name coined in Havel, TF, Biopolymers, 29, 1565-1585 (1990), but it is really the root mean square difference of distance matrices. To compare structures we calculate the DME based on alpha carbon coordinates. We then go back to the distance original distance matrices and find the most different matrix elements and enter a loop:

while (DME > threshold)
    remove most different element from calculation
return the fraction of the distance matrix remaining

If the two sets of coordinates were very similar, then we did not have to remove any matrix elements, so we return a value near 1.0. If the coordinates were very different, we had to throw away most matrix elements, so we return a number closer to 0.0.

A good value for THRESHOLD is about 3.5 or 4.0 Angstrom. This measure of similarity is bounded by 0 and 1.0 and is not too sensitive to protein size.

dme_nice MODEL REF BOUND SIZE CORE CORE_BORDER

Returns a list of intervals defining bounded regions of the difference of distance matrices between MODEL and REF. Whereas in dme_b_cliques, the intervals define bounded regions, the 'nice' regions defined by this function allow for some deviations at the boundary of the flat regions of a DME plot. SIZE refers to the structural fragment size, and CORE refers to the minimum size of a bounded region expressed as the number of overlapping fragments. CORE_BORDER is the size of the boundary region where deviation is allowed. Essentially this means that pairs of intervals that would be returned by dmat_b_cliques will be merged if they are separated by fewer than CORE_BORDER residues.

dssp COORD

Run the DSSP program on the coordinates in COORD. Store the answer in COORD.

make_model PAIR_SET SEQ COORD

Given an alignment in PAIR_SET (a Pair_set pointer) which comes from sequence SEQ aligned to coordinates COORD, return a Coord pointer with a model of the sequence on the coordinates.

model_pdb_num COORD MODELNUM

Given a COORD OBJECT and a residue position, returns the original PDB number stored for that residue.

model_res_num COORD RESIDUENUM

Given a COORD OBJECT and a residue number (according to PDB sequence numbering), returns the index for that residue in the coord object, or undefined if there is no coordinates for that sequence number.

num_seq SEQ_ARRAY

Returns the number of sequences in a sequence array.

pair_set_coverage PAIR_SET SIZE1 SIZE2

Given an alignment, we often want to know what part of the sequence or structure is accounted for. For example

A B C - - D E - F G H
    Q R S T U V W - Y Z

In the first string, you could imagine the residues we know about are like

0 0 1 1 1 1 0 1

while in the second string, the pattern is like

1 0 0 1 1 0 1 1

Call this function with a pair set and the size of the first and second objects (sequences or sequence/structure pair). pair_set_coverage returns two lists. One for each member of the alignment respectively.

This returns a pair of strings. In each string, a '0' char means the site is not covered by the alignment, but a '1' means it is. On failure, returns undef. A typical use would be

my ($s1, $s2) =
    pair_set_coverage ($pair_set, seq_size ($seq), coord_size ($coord);
print "Pattern of sequence coverage looks like\n$s1\n";
print "Pattern of struct coverage looks like \n$s2\n";

and the output strings might look like

00100111

Which would mean that the 3rd, 6th and 7th positions were properly aligned. Note also that this leaves room for some tricks. Strings can be binary &'d to look for overlap and '1's can be quickly counted using perl matching operators.

pair_set_extend PAIR_SET $EXT_LONG
pair_set_extend PAIR_SET $EXT_SHORT

If you have done a Smith and Waterman style alignment, you may not have all residues aligned. Consider

A  B  C  D  Q  R  S  E
      X  Y  Q  R  S  X  Y

The alignment from a highest scoring segment is

Q  R  S
Q  R  S

and this is what we will likely get if we don't ask otherwise. but we could reasonably do a short extension to include residues which go out to the ends of the shorter sequence.

C  D  Q  R  S  E
X  Y  Q  R  S  X

One may also want a longer extension, which would look like

A  B  C  D  Q  R  S  E  -
-  -  X  Y  Q  R  S  X  Y

Given these definitions of extension, (long and short), you may want either result.

The symbols, $EXT_LONG and $EXT_SHORT are constants. You should choose one of them.

pair_set_gap PAIR_SET SCALE_OPEN SCALE_WIDEN

Like coord_geo_gap, this is for calculating extra gap penalties after doing an alignment. This returns a gap penalty for the sequence, not the structure. It returns two values like

my ($open_penalty, $widen_penalty);
($open_penalty, $widen_penalty)
   = pair_set_gap($pair_set, $o_scl, $w_scl);

For each gap, the penalty is calculated as

penalty = open_scale + (n - 1) * widen_scale

where n is the length of the gap. Because these things are linear, you can get the number of gaps in a whole alignment from n_gap = $open_penalty / $open_scale.

pair_set_score PAIR_SET

This returns two floating point numbers. The first is the score from an alignment, including gaps. The second is the score, not including gaps. The intentions are to

  • Let one look at the pure sequence-structure score and

  • Let one use a more sophisticated gap scoring scheme than the one used in the alignment calculation.

For example, we have have something like

my $pair_set = score_mat_sum_sec (.....);
my ($score, $simple_score) = pair_set_score ($pair_set);
print "Score with gaps is ", $score,
      "without gaps is ", $simple_score, "\n";

For compatibility with old scripts, it is still OK to say

my $score = pair_set_score ($pair_set)
pair_set_string PAIR_SET SEQ1 SEQ2

Return a string for printing. PAIR_SET is an alignment corresponding to SEQ1 and SEQ2. If this came from a sequence to structure alignment, we still want the sequence corresponding to SEQ2. This is easy to get from

coord_get_seq (C)

but this routine should be renamed seq_from_coord().

pair_set_pretty_string PAIR_SET SEQ1 SEQ2 SEC_S_DATA COORD2
pair_set_pretty_string PAIR_SET SEQ1 SEQ2

Return a string with a pretty alignment.

If you have secondary structure data (SEC_S) which has probably come from a predictor and if you provide the coordinates (COORD2), then the output will include this secondary structure information.

param_fx_read FNAME

Read parameters for Thomas' Bayesian scoring / alignment functions. Returns an FXParam object on success, or undefined on failure.

param_rs_read FNAME

Read parameters for the tanh() based score function with all atoms. This is for rescoring, not alignment. Returns an RSParam object on success, or undefined on failure.

pdb_read FNAME ACQ_CODE CHAIN

Read the pdb file, FNAME, and return a Coord object. Note that this will not include secondary structure information. ACQ_CODE is the acquisition code. It can be blank, in which case, the program will try to guess the code from the filename. CHAIN can also be blank, or a single letter, chain specifier.

This function is for rewriting lots of files, for example, when we rebuild a library. This means it tolerates errors rather than stopping. Furthermore, it contains a list of residue name substitutions. For example, it will change

original new_name   comment
---------------------------
UNK      ALA        Unknown residues become alanine
MSE      MET        Selenomethionine becomes methionine
CSE      CYS        Selenocysteine becomes cystein

There are currently more than 20 of these conversions. They are a one way affair. After reading the coordinates, the original name is thrown away.

prob_vec_info PROB_VEC

PROB_VEC is a matrix of class probabilities. This returns a string containing some minimal information. Functions like "seq_2_prob_vec" return a probability vector.

scor_set_simpl PAIR_SET SCORE_MAT

Returns a Scor_set object, corresponding to the local score for each aligned pair of residues in PAIR_SET, based on the scores in SCORE_MAT. These local scores, when summed, yield the 'simple score' returned by pair_set_score.

The obvious use of this function is to generate an annotated PDB file :

coord_2_spdb ("1mypC.pdb", make_model($pair_set, $seq, $template)
                         , scor_set_simpl($pair_set,
                                          $tot_score_mat));

Where the temperature entry of the each residue in the model will be set to the local score given in the score matrix.

scor_set_fromvec @SCORES

Returns a Scor_set object containing the vector of scores given in @SCORES.

scor_set_to_vec $scor_set

Returns a vector of doubles corresponding to the scores contained in the Scor_set object. This is useful when you want to actually print out the scores for an alignment :

# float_to_character maps a signed float to a
# scale of alphanumerics

sub d2c( $ ) { return float_to_character ( $_ ) };

my $i=0;
my $cs = pair_set_coverage( $pairset, $seq_len, $coord_len );
my $scor = scor_set_to_vec( $scor_set); # from $pair_set
$cs=~ s/0/./g;
$cs=~ s/1/d2c($$scor[$i++]);/ge;

# $cs is now a coverage string where every aligned position is
# scored according to float_to_character.
scor_set_scale SCOR_SET SCALE

Returns true if it managed to divide each entry in the Scor_set by the scalar in SCALE.

score_fx MATRIX SEQ COORD PARAM

Fill out a score matrix for a sequence structure pair, based on Herr Doktor Huber's Bayesian-based score function. This score function works for sequence to structure alignments.

MATRIX is a new created score matrix, probably returned from score_mat_new. SEQ is a sequence object (SeqPtr) COORD is a coordinate object (CoordPtr). PARAM are parameters from param_fx_read. They have type FXParam and this type checking is enforced by the interpreter.

score_fx_prof MATRIX SP COORD PARAM

This is the same as score_fx, but the second argument, SP, is a sequence profile, rather than a sequence.

score_mat_add (MAT1, MAT2, SCALE, SHIFT)

Return a new score matrix where each element is given by

NEW_MAT = MAT1 + ( SCALE * MAT2 + SHIFT)

If only three arguments are given, SHIFT is set to zero.

score_mat_info MAT

Given a Score_mat object in MAT, return a list of the form (MIN, MAX, AV, STD_DEV) Ignore the first and last row and column, since they are used for treatment of end gaps.

Note, the calling procedure for this function changed in May 2006, so old scripts may break;

score_mat_new N_ROWS N_COLS

Create a new score matrix and return a Score_mat object. In a sequence to structure alignment, N_ROWS is the size of the sequence, N_COLS is the size of the structure. So, if $seq is a sequence and $coord is a structure, we might have

$scr_mat = score_mat_new (seq_size ($seq), coord_size ($coord));

You are entitled to assume that the new matrix is properly zeroed. Code should not bother to zero it again.

score_mat_scale MAT, SCALE

Multiply all elements in MAT by SCALE. This returns a new matrix.

score_mat_shift (MAT1, SHIFT)

Return a new score matrix where each element is given by

NEW_MAT = MAT1 + SHIFT

The first and last row and column are not shifted. This is one way to implement end gaps. This function returns a new matrix.

score_pvec MATRIX, PVEC1, PVEC2

Fill out the score matrix, MATRIX based on comparing the class membership probability vectors PVEC1 and PVEC2. These probability vectors check the length of the original fragments and will return undef if they do not match.

score_mat_read FNAME

Go to FNAME and read the score matrix that was probably written by score_mat_write. Return a new score matrix or undef on failure like

my $scr_mat = score_mat_read ($fname);
if (! $scr_mat) {
    print STDERR "Reading score matrix from $fname failed.\n"; }
score_mat_string SCORE_MAT SEQ1 SEQ2

This is mainly for debugging. After calling a score function, this will return a string containing the score matrix with the sequence at the top and left hand side. If you did a sequence to structure alignment, you should still call it with the sequences.

score_mat_sum_full RMAT, SCORE_MAT, PGAP_OPEN, PGAP_WIDEN, QGAP_OPEN, QGAP_WIDEN,P_MULT, Q_MULT, ALIGN_TYPE
score_mat_sum_full RMAT, SCORE_MAT, PGAP_OPEN, PGAP_WIDEN, QGAP_OPEN, QGAP_WIDEN,P_MULT, Q_MULT, ALIGN_TYPE, BIAS_SET

This does the summation of a score matrix. Returns non-zero on success and zero on failure. Note, there is an optional last argument. The parameters are

RMAT

This is used for returning the summed matrix. The function does not overwrite SCORE_MAT. Instead it returns a new matrix. This can generally be ignored.

SCORE_MAT

This is the score matrix which came from a call to something like score_smat.

PGAP_OPEN

Penalty for opening gaps in the first of the pair of sequences or structures or whatever.

PGAP_WIDEN

Penalty for extending the PGAPs.

QGAP_OPEN

Penalty for opening gaps in the second of the pair.

QGAP_WIDEN

Penalty for extending the QGAPs

P_MULT

This is an object of FloatPtr type with site specific coefficients for gap penalties and will be applied to the P_GAP. Typically, this will be for extra penalties for secondary structure gaps.

Q_MULT

As for P_MULT, but for the q_gaps.

ALIGN_TYPE

Either "$N_AND_W" or "$S_AND_W" for Needleman and Wunsch or Smith and Waterman respectively.

BIAS_SET

This is a pair_set which may have come from a previous alignment. It really is an object of Pair_setPtr type. It is optional. If present, the alignment will be coerced to follow this alignment. The use is that one can do an Smith and Waterman alignment and get the best scoring segment. One can then set the ALIGN_TYPE switch to $N_AND_W and do a globally optimal alignment, but passing through the optimal segment.

score_mat_write SCORE_MAT FNAME

Write the score matrix SCORE_MAT to the file name given by FNAME. Return undef on failure.

score_rs COORD PARAMS

This will apply the tanh() based score function which is primarily for rescoring. It is NOT neighbour-non-specific or unusual in any other way.

COORD is a Coord object. PARAMS is a set of parameters, formally of type RSParams and probably resulting from a call to param_rs_read.

Returns a floating point number.

score_smat SCORE_MAT, SEQ1, SEQ2, SUB_MAT

Fill out a score matrix based on sequence comparison of two sequences. SCORE_MAT is a score matrix which may have come from new_matrix. SEQ1 and SEQ2 are the sequences. SUB_MAT is a substitution matrix.

score_sprof SCORE_MAT, PROF, SEQ, SUB_MAT

This is very similar to score_smat. SCORE_MAT is a score matrix, probably from new_matrix. PROF is a sequence profile, maybe from blst_chk_read. SEQ is a sequence and SUB_MAT is a substition matrix. This fills out a score matrix using the substition matrix it has been given and fractional sequences from the sequence profile.

score_sec_s SCORE_MAT, SEC_S_DATA, COORD

Given a SCORE_MAT which probably came from score_mat_new, and some secondary structure data in SEC_S_DATA, probably from sec_s_data_read and a set of coordinates in COORD, do a scoring.

We compare each site against the predicted secondary structure. If the psi angle difference is more than 90 degrees, we set it to 90. We then return cos (difference). Note, this is different behaviour to earlier versions. If an angle is very different (diff > 90), it it not penalised. If the angle is less than 90, a value will be returned between 0 and 1.0

sec_s_data_read FNAME

Read secondary structure data from FNAME. This may be in PHD/Rost format or our "item_manual_secondary_structure_file" in "manual input format" format defined below. Return a pointer to an object of Sec_s_data type.

sec_s_data_string SEC_S

SEC_S is a pointer to a Sec_s_data object. Return a string with the bare secondary structure information. For example:

$x = sec_s_data_read ("sec_data_file")
       || die "Fail on $tmp_data: $!";
print sec_s_data_string($x);
seq_2_prob_vec SEQ, AA_CLSSFCN Given a sequence, SEQ and an amino acid classification, AA_CLSSFCN return an object of PROB_VEC type. This is most interesting for feeding to a score function like score_pvec.
seq_read FNAME

Read a sequence from file FNAME. Returns a SeqPtr object.

seq_read_many FNAME

Reads a number of sequences from file FNAME. Returns a SeqArrayPtr object (array of sequences).

seq_from_string STRING

Returns a sequence object SeqPtr from a string. This lets one build a sequence in the interpreter like

$seq = seq_from_string ('avlc');

Would store a sequence (Ala, Val, Leu, Cys) in the $seq variable. The string can have a FASTA style comment at the start. The string can span multiple lines.

seq_get_1 SEQ_ARRAY N

Return the N'th string from a sequence array, SEQ_ARRAY. Returns a SeqPtr object.

seq_num SEQ_ARRAY

Returns the number of sequences stored in the sequence array, SEQ_ARRAY.

seq_print SEQ

Return a sequence to the interpreter as a string. It is not pretty and will be fixed.

  • The sequence will be beautified so it comes out in blocks of ten residues, probably with numbering.

seq_print_many SEQ_ARRAY

Returns a string containing all the sequences in SEQ_ARRAY.

seq_size SEQ

Return the number of residues in a sequence, SEQ.

This is the number of residues in a sequence. There are no complications or frills like null terminators.

seqprof_get_seq SEQPROF

This is analogous to coord_get_seq. It takes a sequence profile and returns a Seq sequence.

seqprof_str SEQPROF

Take the sequence profile in the SEQPROF object and return a printable string. For example

if ( ! ($profile = blst_chk_read ( $fname))) {
    return undef; }
print seqprof_str ($profile);
struct_2_prob_vec COORD CLSSFCN Computes the memberships matrix (probability vector) for a structure COORD given a CLSSFCN
sub_mat_get_by_i SUB_MAT, M, N

Return the substition matrix element indexed by M and N

sub_mat_get_by_c SUB_MAT, A, B

Return the substitution matrix element for the amino acids A and B. So $a = sub_mat_get_by_c ($sub_mat, 'c', 'd'); would return the current value for cys/asp. The amino acid names are single letter codes and may be upper or lower case.

sub_mat_set_by_c SUB_MAT, A, B, val

Set the substitution matrix element for A and B to val.

sub_mat_set_by_i SUB_MAT, M, N, val

Set the value indexed by M,N in the substition matrix to val.

sub_mat_string SUB_MAT

Return a string containing the substitution / score matrix held in SUB_MAT.

sub_mat_read (FILENAME)

Go to FILENAME. Read up the substitution / score matrix and return it.

sub_mat_shift SUBST_MATRIX, BOTTOM

Given a substitution matrix (an object of type Sub_mat), shift the whole matrix so the smallest (most negative value) is of size BOTTOM. This does not return anything. It acts on the SUBST_MATRIX argument directly.

sub_mat_scale SUBST_MATRIX, BOTTOM, TOP

Given a substition matrix, SUBST_MATRIX, scale and shift it so the minimum and maximum values run from BOTTOM to TOP.

score_mat_sum_smpl NEW_MAT SCORE_MAT PGAP_OPEN PGAP_WIDEN QGAP_OPEN QGAP_WIDEN ALIGNMENT_TYPE

We have a score matrix which could be from sequence/sequence, sequence/structure or whatever. Now, do the dynamic programming work. Sum the score matrix and return a set of pairs.

The parameters are

NEW_MAT

This is a fresh matrix with the traced back scores in it.

SCORE_MAT

This is the score matrix.

GAP_OPEN
GAP_WIDEN
ALIGNMENT_TYPE

There are only two values allowed, either

$N_AND_W

or

$S_AND_W

These stand for "Needleman and Wunsch" and "Smith and Waterman" respectively. Any other value will cause an error.

svm_rs_cdata MODEL NATIVE SCOR_SET RS_PARAM CVTYPE

*EXPERIMENTAL!*

The function returns an array of training vectors suitable for use in training a support vector machine (libSVM.pm) or some other machine learning procedure. Its form is :

[ [label_class, [(feature vector)], .. ]

The scheme for calculating the training vectory is given in CVTYPE, and the data is formed from the local sequence to structure scores as given by SCOR_SET, the TANH forcefield based pairwise interaction terms (calculated via RS_PARAM), and the local model consistency (based on the difference of distance matrices computed between MODEL and NATIVE).

Scheme 0 works as follows : (see scoranlys.c:get_svmdata for details at the moment).

svm_rsfeat MODEL SCOR_SET RS_PARAMS CVTYPE

This returns a set of feature vectors for each position in MODEL, calculated from local sequence-structure fitness and residue-specific interaction terms according to the CVTYPE scheme (see svm_rs_cdata or scoranlys.c for details). The form is as follows :

my @m_fvset = svm_rsfeat MODEL, SCOR_SET, PARAMS, 0
@m_fvset is of form
  [ (undef), [feature vector], .., .., (undef)]
and (scalar @m_fvset) == coord_size(MODEL)

undefs are given for positions in the model where a full feature vector cannot be computed (at the ends, for instance).

BUILD AND INSTALL

Generally, nothing except the top level Makefile should be changed. In the top level directory, edit the Makefile, then type

make

If this looks OK, you might

cd scripts
perl hello.pl

This will check if Wurst pieces appear to be in place. If that looks OK, then edit wurst/src/Wurst/Makefile.PL to set the installation destination. Then

make install

from the top level directory. Then go back to the scripts directory and try a different file like

cd scripts
perl hello2.pl

This will attempt to run a test using something like "lib" as the destination directory.

FILE FORMATS

PHD secondary structure files

Wurst can read secondary structures from the PHD prediction server. The format is empirically defined. That means we try to read anything that comes from the server.

manual secondary structure file

One may type in secondary structure predictions or assignments. The format is like:

# speculative secondary structure assignments
secondary structure
5 - 30 h
37 e
40-45 e 5  # This is a very unreliable guess
  • This means that residues 5 to 30 are helix. Residue 37 is sheet (extended). Residues 40 to 45 are sheet (extended) and they have a confidence level of 5.

  • The first non blank line must say "secondary structure". This is not optional. It is used to recognise the file format.

  • Confidence levels can be given from 0 to 9. Zero means there is no confidence. 9 means you are very confident. This number must be an integer. There is no way to give a more detailed number.

  • Confidence levels are optional. The default is confidence=9.

  • Anything after a hash # is a comment.

  • Blank lines are ignored.

AUTHORS

Alphabetically... Steve Hoffmann, Thomas Huber, Tina Lai, Nasir Mahmood, Thomas Margraf, Martin Mosisch, James B. Procter, Gundolf Schenk, Andrew E. Torda.

SEE ALSO

coding.pod contains rules for adding to wurst.