NAME
Code - notes and coding rules
DESCRIPTION
The documentation for using wurst lives in wurst.pod. This is a description of internals, conventions and how to change the code.
Layout
- Philosophy
-
Have a look at some example files. There is a standard layout for making functions stand out. Please do not use your own layout, even if it is prettier. The aim is consistency and predictability.
- Tabs
-
There are no tab characters. Use spaces. Use four spaces to indent in functions and loops.
- Indentation
-
Indentation is almost pure Kernighan and Ritchie with four spaces.
- Line size
-
Try to make code readable in a window of 80 columns or printed page of 80 columns. This may mean shortening your line or breaking it in a place you think makes it more readable for others.
- Version control
-
All the files have lines like this:
#if !defined (lint) && !defined (DONT_SEE_RCS) static const char *rcsid = "$Id: coding.pod,v 1.1 2007/09/28 16:57:19 mmundry Exp $"; #endif /* !defined (lint) && !defined (DONT_SEE_RCS) */
These come after the #include directives.
- Function headings
-
Begin every function as in the examples. Do not use your own style, even if you think yours is better. Every function begins:
/* ---------------- foobar ----------------------------------- * This function takes an argument and makes coffee. */ static int foobar (int zot) { ....
To list the properties:
the comment style at the start should be respected, including the little stars and the empty column.
function type on its own line
the opening brace is on its own line
- Punctuation
-
There should usually be a space after punctuation, for exactly the same reasons as in English (helping readability). Commas and semi-colons definitely want to be followed by a space.
Do not write
for (a=0;a<lim;a++){
Do
for (a = 0; a < lim; a++) {
If the line is going to go beyone 79 spaces, delicately delete spaces while trying to keep readability.
- typedef
-
A typedef is useful for complicated structures. In many other places it obscures the meaning. In wurst, typedefs should never go into function prototypes. If they do, then other files cannot easily include the function prototype. Consider a very complicated structure of type
struct tree
. In a header you may writestruct tree; int function (struct tree *t);
Then anyone can include it without needing the definition of
struct tree
.If you write
int function (treestruct *t);
Then nobody can include it without seeing the
typedef
. - include files
-
Do not include files from other include files.
Split the description of the internal structures from the interface. Call the file with the internal structures,
blah.h
and the one with the interface,blah_i.h
.
- boolean
-
Do not add your own type with a name like "boolean" for true/false. If you are talking about errors, use
EXIT_SUCCESS
/EXIT_FAILURE
since these are ANSI standard. If you have a yes/no situation, then there is a header file with a simple enumerated type in yesno.h. The problem with your own boolean is that it gets redefined by different people, TRUE and FALSE often appear in unrelated header files and the most recent C standard has its own idea about boolean types.
INTERNALS
Declarations
- Ordering
-
The order in which you declare variables should make no difference. On the compaq, at least within structures, the compiler makes noise unless you order objects from largest to smallest. This avoids padding to get best alignment.
- Function declarations
-
If a function can possibly be made static, it should be. Symbols should be no more visible than is necessary. There are also optimisation advantages (the compiler can assume a function is local and generate only an inlined version).
If you have a static function you must
Not declare it, then define it and use it. Instead,
You should define it once, before it is used.
In other words, do
static int function (int arg) { ... } ... b = function (a);
Rather than
static int function (int arg); ... static int function (int arg) { ... } ... b = function (a);
Obviously this rule has to be broken if there is mutual recursion.
This rule exists for consistency (it is used in every file) and to follow the principle that we should write things the minimum number of times and lastly, to ensure that whenever possible, a compiler is given all the information about a static function by the time it is used.
Return values and errors
There are two aspects to this.
- Error messages
-
Generally, the C functions do not print out too much information in case of error. They must, however, return an error code to the interpreter. The C functions should use
err_printf()
to pass on the result ofstrerror()
/perror()
. Typically, this means the lower level C code will write something likeno such file or directory
, but it is up to the interpreter/script to decide whether or not to go on. - Return codes
-
Most of the C code returns
EXIT_SUCCESS
if it is happy andEXIT_FAILURE
(as defined in stdlib.h). This leads to a problem.EXIT_FAILURE
may be defined as1
, but the perl interpreter usually takes1
as indicating success. The convention in this code is thatThe C code returns
EXIT_SUCCESS
/EXIT_FAILURE
as it pleases. The interface code (Wurst.xs) handles swapping the sense when necessary. So, if C code returns an error, the interface code may doXSRETURN_UNDEF
.
Conventions and internal routines
Wurst code may not call any of the following directly.
There are some general areas to think about.
- printing
-
Functions should not print to stdout or stderr directly. Everything should go via
err_printf()
ormprintf()
ormfprintf()
.If you have to print, you code must include
#include "mprintf.h"
- mprintf()
-
Takes exactly the same arguments as
printf()
. - mfprint()
-
Exactly the same as
fprintf()
. - err_printf()
-
This is a replacement for
fprintf(stderr, "blah")
, but it is not a direct wrapper. The convention is#include <stdio.h> #include "mprintf.h" ... int foo_bar (...) { const char *this_sub = "foo_bar"; ... if (error) err_printf (this_sub, "blah %s message\n", args); }
This will result in a printing to something like stderr a string like
Function foo_bar: blah args messsage
- mputchar()
- mfputc()
- mputs()
- mfputs()
- mperror()
-
All these functions with names like mxxx(), behave exactly as xxx().
- Passing strings to interpreter
-
If you have a static string to pass back to the interpreter, declare the function as
char * some_func (..)
and all will be well. This. however, is the exception. Normally, there is a string to be printed out and it has been dynamically allocated. After the interpreter has finished with the string, it should be free()'d. There are two ways to go about this.
perls SV mechanism
our internal scr_xxx() mechanism
Both the approaches are fine, but perhaps the first (perl's mechanism) should be preferred from now on.
Imagine your function allocates space for a string, writes something into it and you want the interpreter to be able to print it out.
In the C code, declare your function returning char * like
char * my_func (...) { char *x = E_MALLOC (sizeof (x[0] * length); strcpy (x, "something"); return x; }
Then, in the file, something.xs, use an interface like
SV * my_func (a) int a; CODE: { char *s = my_func(...); RETVAL = newSVpv (s, 0); free (s); } OUTPUT: RETVAL
This will copy the string into a
SV
which the interpreter knows how to deal with. We can immediately free() the pointer from our function.The second method is to use the scratch space routines,
scr_printf()
like this.#include "scratch.h" char *s; scr_reset(); s = scr_printf ("blah %s\n", foo); s = scr_printf ("more text\n"); return s;
The first call
scr_reset()
is necessary to tell our scratch space to free any old strings. Subsequent calls append the strings. - Memory allocation and malloc
-
Do not call
malloc()
. Instead, call the macroE_MALLOC()
. This takes exactly the same arguments asmalloc()
, but expands to print out the file and line number ifmalloc()
fails.It is a conscious design decision that we do not do any error recovery there. If we run out of memory, we are cactus.
Memory allocation philosophy and responsibility
Most of the time, memory is malloc()'d and free()'d at the same level of code. There is a major exception to this. Objects which are passed back to the interpreter rely on the interpreter's garbage collection. You must declare and appropriate routine like foo_blah_DESTROY(). This will be called by the interpreter when the reference count for the object goes to zero. This routine must clean up any allocated space.
This approach works well for objects like score matrices. These are allocated once. Although their contents may be manipulated, they do not move. Other objects, such as sequence arrays may grow or shrink and move about as determined by calls to E_REALLOC(). This means we have an extra level of redirection and the interpreter is actually given a pointer to a pointer to a sequence array.
Score matrix storage
The score matrix for two objects of size M and N is of size, (M + 2) * (N + 2). The reason is that it is easier to do special treatment of ends and end gaps this way. This means one has to be careful when coding around the score matrices.
Sequences and sequence arrays
In an early version, functions operated on arrays of sequences. This is being phased out. Some functions may still be able to handle arrays of sequences, mainly for getting sequences in FASTA format.
Structures and special rules
- struct seq
-
- size
-
These items hold a sequence. The size element holds the number of residues, but the string is allocated for (n + 1) so we can really treat it as a string with a null terminator.
- format
-
Sequence strings can either be conventional, one-letter amino acid names or in 'thomas' format. The strings have an enumerated type to say what state they are in. If your code depends on some format, then call the appropriate function to convert the string from one style to the other. For example, score functions will often force strings to 'THOMAS' format. Before printing, strings will usually be converted to 'PRINTABLE' format.
- struct coord
-
The coordinate structure not only holds coordinates, it holds a corresponding sequence in a
struct seq
containe within. The number of elements in the coord and seq structures must be the same.Within the coord structure, code may not assume that the secondary strucure (sec_s) is complete, nor the phi angles (phi). If these are NULL pointers, it means they have not been read up, calculated or filled out. The
coord_DESTROY
routine has a look to see if these are NULL pointers and callsfree()
only if they are non-NULL. - struct prob_vec
-
Probability vectors are complicated because
They can be in the simple array or a compact format
They may be normalised in two different ways.
The number of sites and probability vectors may vary.
In more detail
- prob_vec data storage
-
- expanded
-
In expanded form, the data is in a dynamically allocated 2D array. In mship[a][b], the slower changing index,
a
is the site in the protein. The faster,b
, is the class index. - compressed
-
In the
mship[a][b]
example, most of the elements are near zero. In the compressed form,cmpct_n->cmpct_n
is an array where each element tells us how many non-zero probabilities are associated with sitea
. For sitei
,cmpct_n[i]
says how many probabilities are stored inprob_vec->cmpct_prob
. For each of those probabilities, there is a corresponding entry inprob_vec->cmpct_ndx
which is the number (index) of the associated class. The number of elements incmpct_prob
will be the same as incmpct_ndx
. - protocol
-
If there is no compressed data,
prob_vec->cmpct_n
must point toNULL
. Ifprob_vec->cmpct_n
is non-null, the pointer will be assumed valid. If there is no expanded data,prob_vec->mship
must point toNULL
.
- prob_vec normalisation
-
In terms of probabilities, the vector should be normalised so that all the entries sum to 1.0. When comparing objects, we want to treat the probability vectors as vectors of unit length, so we can take the dot product in order to see the similarity. We also have to allow for not knowing the state of our vectors. These three situations are coded for by setting
prob_vec->norm_type
toPVEC_TRUE_PROB
PVEC_UNIT_VEC
PVEC_CRAP
- prob_vec number of sites / vectors
-
Describe the number of sites.
ADDING WURST
To add a function to wurst, you have to add something to
- Wurst/Wurst.xs
-
This is the C/perl interface
- Wurst/Wurst.pm
-
This advertises the symbols which the scripts can use.
- The C file
-
The actual code to be called.
- pod/wurst.pod
-
This is the documentation.
In more detail. If you want to add a function, do_stuff which acts on something of type thing_struct, then do at least the following.
C code
In do_stuff.c, or wherever, define the function like
int
do_stuff (struct thing_struct *thing) {
....
}
In do_stuff.h, prototype the function interface like
int do_stuff (struct thing_struct *thing);
Don't define the structure here. It is not necessary for the perl interface to see structure internals.
XS code
In Wurst/Wurst.xs, add
#include "do_stuff.h"
and a typedef
typedef struct thing_struct Thing_struct;
The capitalisation is not a joke. Perl likes it. Finally, still in Wurst/Wurst.xs, add the function interface,
int
do_stuff (x)
Thing_struct *x;
.pm code
Go to Wurst.pm. Find the @EXPORT section and add do_stuff.
.pod documentation
Go to pod/wurst.pod and add a description of the new function under the heading, FUNCTIONS
If the function returns some data type back to the interpreter, add a mention of that data type as well.
DEBUGGING
General debugging
There are two kinds of debugging:
- perl debugging
-
This is not so interesting. Use the perl debugger.
- debugging the C code
-
This is more fun and discussed below.
Wurst is coded as a perl extension, but perl does not have debugging symbols. During development, we will typically compile the extension with -g
. The problem is that you cannot set breakpoints since at program load time, the perl extension (a dynamically loaded library) is not there and the debugger does not see a symbol table.
This is not a terrible problem. The trick is to send a little signal after the program has started, but before the breakpoint. There are two ways to do this.
- signal from perl
-
Somewhere early in the perl script, before entering the function where you want a breakpoint, insert
$SIG{TRAP} = 'IGNORE'; kill 'TRAP', $$;
You could also set
$SIG{TRAP}
to point to some function of your own.These lines say that we will ignore TRAP signals. Then the code sends a TRAP to itself. The debugger notices the signal and gives you a command prompt. Now, the dynamic library is loaded and breakpoints can be set. This works with either gdb or dbx.
The nice aspect of this approach is that when one is not debugging, the script runs fine. The signal is sent, but ignored.
- signal from C
-
Sometimes one knows pretty well which function is going to be a problem, and where one wants to look closely. We can use the same mechanism by calling
kill()
from the C code. The easiest way to do this is to find the file of interest. At the top of the file,#include "dbg.h"
In the subroutine of interest, insert a line
breaker();
This little function just gets out the current process ID and sends it a trap'able signal. The debugger will give you a command prompt inside the breaker() function which can be quickly stepped out of.
Memory debugging
This is a challenge. One wants to work with a debugging malloc()
library, which looks for leaks and so on. When building perl, it offers you the use of its own malloc()
library or the system one. It is probably useful to build perl with the system library. The memory checking under dbx
on Solaris works fine. Getting an add-on library like electric fence to work under solaris is a bore. Wild use of the LD_PRELOAD environment variable results in a program which gobbles up all available memory. Under linux, it seems to be no problem to link electric fence in.
We assume that perl itself is free of leaks and we only want to instrument our library.
Rules
Please follow these rules when writing wurst code.
Function interfaces
Keep the definition of the interface minimal and separate from the definition of the innards. If you have a C file call foo.c, then
Put the public interface in foo_i.h
Put the innnards in a file like foo.h
For example
You may have some code like
struct cow {
int legs;
float weight;
};
int
print_cow ( struct cow * daisy)
{
....
}
Now there are a few files which have to know the innards of a struct cow
and others which will want to call print_cow(). In this case, create a file cow.h which defines the structure:
struct cow {
int legs;
float weight;
};
and in a separate file, cow_i.h,
struct cow;
int print_cow (struct cow *c);
Note:
Callers of print_cow() do not need to see the innards of the cow struct.
You should not use a typedef in the function defintion or prototype. If the code said
struct cow {...}. typedef struct cow Cow; typedef struct cow *CowPtr;
then the prototype for print_cow() would be
int print_cow (CowPtr);
but then every caller would need a definition of CowPtr. Please do not do that.
Text and messages
- mfprintf.c
-
Use the wrappers in mprintf.c. These may not do anything fancy now, but they allow us to wrap our output at a later stage. For example, a Tcl extension is not allowed to read or write to stdout / stderr. In that case, the wrappers can do whatever is necessary to get printf(), fprintf() functions.
- File Opening
-
It is so frequent that we call fopen() with an error message, that this is now in the wrapper, mfopen() in fio.c. Use
#include "fio.h" FILE * mfopen (const char *fname, const char *mode, const char *s)
where fname and mode are as for fopen() and s is a string which will go into error messages. Typically, this will be the name of the caller.
- File Reading and Caching
-
We have a hook to ask the OS not not to cache a file after reading it for opening. Use the function
file_no_cache(fp)
on the FILE pointer,fp
. This somtimes makes a surprising difference in performance. When reading a library of structures or profiles, wurst may read almost 10000 files sequentially. The OS tries to cache each one, although they will not be read again. The damage is that it pushes useful pages such as other people's programs out of the cache. The functionfile_no_cache()
is a wrapper around a posix function which may not be present, so the wrapper checks appropriate#define
d values and is a no-op if the posix functions do not seem to be present. A typical usage would be, if ((fp = mfopen (fname, "r", this_sub)) == NULL) return NULL;{ int tmp; const char *s = "Turning off caching for %s:\n\"%s\"\n"; if ((tmp = file_no_cache(fp)) != 0) err_printf (this_sub, s, fname, strerror (tmp)); }
RETURN value is zero on success or
errno
on failure. - Reading lines from files
-
Very often, we want to read a line from a file, hopping over blank lines and lines that begin with a hash (#). Do not write another function to do this.
#include "misc.h" get_nline (FILE *fp, char *buf, int *nr_line, size_t maxbuf)
Reads a line from fp. into buffer buf whose size is given by maxbuf. nr_line points to a line counter. get_nline() will increment the variable on every line read.
The routine throws away anything on a line after the hash (#) character, so you can have inline comments.
amino acids
There are a couple of common conversions for amino acid names and types. We may go from one to three letter codes and, internally, from Thomas to standard one letter codes. All functions for this kind of thing live in amino_a.c.
memory allocation
- standard malloc()
-
All memory allocation must be done through the macros E_MALLOC() and E_REALLOC() which live in e_malloc.c. They take the standard arguments and return the standard results, but, on failure, print out a line saying where the error occurred, how much memory was requested and then they die. This is generally good behaviour. Fancier code could attempt error recovery, but usually when malloc() fails, it means there was a code error.
- matrices
-
All two dimensional matrices should be allocated via the routines which have their roots in Thomas' code (which is basically the outline given in Numerical Recipes).
Function naming
Some days, I call my functions print_thing()
and print_foo()
. On others, I use foo_print()
and thing_print()
. Neither is better than the other. Here is the rule...
Functions should be named thing_print()
.
Note, I am aware of the disadvantage that you cannot see all the things you can print or copy or whatever.