representative_sequences

Example:

representative_sequences [arguments] < input > output

This is a pipe command. The input is taken from the standard input, and the output is to the standard output.

Documentation for underlying call

This script is a wrapper for the CDMI-API call representative_sequences. It is documented as follows:

we return two arguments. The first is the list of representative triples, and the second is the list of sets (the first entry always being the representative sequence)

-e Existing Representatives

These are sequences that are currently represenatives and the command extends this set.

-o order-option

order sequences using the designated option (note that -b is another way to get a long-to-short ordering). Supported options are

long-to-short
default (as is)
-b

order input sequences by size (long to short)

-c cluster_type

behavior of clustering algorithm (0 or 1, D=1)

cluster_type 0 is the original method, which has only the representative for each group in the blast database. This can randomly segregate distant members of groups, regardless of the placement of other very similar sequences.

cluster_type 1 adds more diverse representatives of a group in the blast database. This is slightly more expensive, but is much less likely to split close relatives into different groups.

-d seq_clust_dir - directory for files of clustered sequencees

With the -d option, each cluster of sequences is written to a distinct file in the specified directory.

-f id_clust_file - file with one line per cluster, listing its ids

With the -f option, for each cluster, a tab-separated list of ids is written to the specified file.

-m measure_of_sim - measure of similarity to use:

Sequences are removed if there similarity to a "kept" sequence exceeds a specified threshold (see -similarity below)

The possible measures of similarity that you can specify are as follows:

identity_fraction (default), positive_fraction (proteins only), or score_per_position (0-2 bits)

-s similarity - similarity required to be clustered (D = 0.8)

The similarity threshhold used to determine when sequences are deleted (but represented by a kept sequence).

Parameter and return types
$seq_set is a seq_set
$rep_seq_parms is a rep_seq_parms
$return_1 is an id_set
$return_2 is a reference to a list where each element is an id_set
seq_set is a reference to a list where each element is a seq_triple
seq_triple is a reference to a list containing 3 items:
	0: an id
	1: a comment
	2: a sequence
id is a string
comment is a string
sequence is a string
rep_seq_parms is a reference to a hash where the following keys are defined:
	existing_reps has a value which is a seq_set
	order has a value which is an int
	alg has a value which is an int
	type_sim has a value which is an int
	cutoff has a value which is a float
id_set is a reference to a list where each element is an id

Input Format

The input is a fasta-formatted set of sequences.  These sequences
should not contain indels.

Output Format

FASTA output of the representatives is always written to STDOUT.

The -d option will cause a directory to be built containing the clusters.

The -f option will cause an abbreviated format of the clusters (just IDs) to
be written