NAME

PDL::Cluster - PDL interface to the C Clustering Library

SYNOPSIS

use PDL::Cluster;

##-----------------------------------------------------
## Data Format
$d =   42;                     ##-- number of features
$n = 1024;                     ##-- number of data elements

$data = random($d,$n);         ##-- data matrix
$elt  = $data->slice(",($i)"); ##-- element data vector
$ftr  = $data->slice("($j),"); ##-- feature vector over all elements

$wts  = ones($d)/$d;           ##-- feature weights
$msk  = ones($d,$n);           ##-- missing-datum mask (1=ok)

##-----------------------------------------------------
## Library Utilties

$mean = $ftr->cmean();
$median = $ftr->cmedian();

calculate_weights($data,$msk,$wts, $cutoff,$expnt,
                  $weights);

##-----------------------------------------------------
## Distance Functions

clusterdistance($data,$msk,$wts, $n1,$n2,$idx1,$idx2,
                $dist,
                $distFlag, $methodFlag2);

distancematrix($data,$msk,$wts, $distmat, $distFlag);

##-----------------------------------------------------
## Partitioning Algorithms

getclustermean($data,$msk,$clusterids,
               $ctrdata, $ctrmask);

getclustermedian($data,$msk,$clusterids,
                 $ctrdata, $ctrmask);

getclustermedoid($distmat,$clusterids,$centroids,
                 $errorsums);

kcluster($k, $data,$msk,$wts, $npass,
         $clusterids, $error, $nfound,
         $distFlag, $methodFlag);

kmedoids($k, $distmat,$npass,
         $clusterids, $error, $nfound);

##-----------------------------------------------------
## Hierarchical Algorithms

treecluster($data,$msk,$wts,
            $tree, $lnkdist,
            $distFlag, $methodFlag);

treeclusterd($data,$msk,$wts, $distmat,
             $tree, $lnkdist,
             $distFlag, $methodFlag);

cuttree($tree, $nclusters,
        $clusterids);

##-----------------------------------------------------
## Self-Organizing Maps

somcluster($data,$msk,$wts, $nx,$ny,$tau,$niter,
           $clusterids,
           $distFlag);

##-----------------------------------------------------
## Principal Component Analysis

pca($U, $S, $V);

##-----------------------------------------------------
## Extensions

rowdistances($data,$msk,$wts, $rowids1,$rowids2, $distvec, $distFlag);
clusterdistances($data,$msk,$wts, $rowids, $index2,
                 $dist,
                 $distFlag, $methodFlag);

clustersizes($clusterids, $clustersizes);
clusterelements($clustierids, $clustersizes, $eltids);
clusterelementmask($clusterids, $eltmask);

clusterdistancematrix($data,$msk,$wts,
                      $rowids, $clustersizes, $eltids,
                      $dist,
                      $distFlag, $methodFlag);

clusterenc($clusterids, $clens,$cvals,$crowids, $k);
clusterdec($clens,$cvals,$crowids, $clusterids, $k);
clusteroffsets($clusterids, $coffsets,$cvals,$crowids, $k);
clusterdistancematrixenc($data,$msk,$wts,
                         $clens1,$crowids1, $clens2,$crowids2,
                         $dist,
                         $distFlag, $methodFlag);

a: Distance between the arithmetic means of the two clusters, as for treecluster() "f".
m: Distance between the medians of the two clusters, as for treecluster() "c".
s: Minimum pairwise distance between members of the two clusters, as for treecluster() "s".
x: Maximum pairwise distance between members of the two clusters as for treecluster() "m".
v: Average of the pairwise distances between members of the two clusters, as for treecluster() "a".

getclustermean

Signature: (
 double data(d,n);
 int    mask(d,n);
 int    clusterids(n);
 double [o]cdata(d,k);
 int    [o]cmask(d,k);
 )

Really just a wrapper for getclustercentroids(...,"a").

getclustermedian

Signature: (
 double data(d,n);
 int    mask(d,n);
 int    clusterids(n);
 double [o]cdata(d,k);
 int    [o]cmask(d,k);
 )

Really just a wrapper for getclustercentroids(...,"m").

clusterenc

Signature: (
 int    clusterids(n);
 int [o]clusterlens(k1);
 int [o]clustervals(k1);
 int [o]clusterrows(n);
 ;
 int k1;
 )

Encodes datum-to-cluster vector $clusterids() for efficiently mapping clusters-to-data. Returned PDL $clusterlens() holds the lengths of each cluster containing at least one element. $clustervals() holds the IDs of such clusters as they appear as values in $clusterids(). $clusterrows() is such that:

all( rld($clusterlens, $clustervals) == $clusterids )

... if all available cluster-ids are in use.

If specified, $k1 is a perl scalar holding the number of clusters (maximum cluster index + 1); an appropriate value will guessed from $clusterids() otherwise.

Really just a wrapper for some lower-level PDL and PDL::Cluster calls.

clusterdec

Signature: (
 int    clusterlens(k1);
 int    clustervals(k1);
 int    clusterrows(n);
 int [o]clusterids(n);
 )

Decodes cluster-to-datum vectors ($clusterlens,$clustervals,$clusterrows) into a single datum-to-cluster vector $clusterids(). $(clusterlens,$clustervals,$clusterrows) are as returned by the clusterenc() method.

Un-addressed row-index values in $clusterrows() will be assigned the pseudo-cluster (-1) in $clusterids().

Really just a wrapper for some lower-level PDL calls.

clusteroffsets

Signature: (
 int    clusterids(n);
 int [o]clusteroffsets(k1+1);
 int [o]clustervals(k1);
 int [o]clusterrows(n);
 ;
 int k1;
 )

Encodes datum-to-cluster vector $clusterids() for efficiently mapping clusters-to-data. Like clusterenc(), but returns cumulative offsets instead of lengths.

Really just a wrapper for clusterenc(), cumusumover(), and append().

attachtonearestd

Signature: (
 double cdistmat(k,n);
 int rowids(nr);
 int [o]clusterids(nr);
 double [o]dists(nr);
 )

Assigns each specified data row to the nearest cluster centroid, as for attachtonearest(), given the datum-to-cluster distance matrix $cdistmat(). Currently just a wrapper for a few PDL calls. In scalar context returns $clusterids(), in list context returns the list ($clusterids(),$dists()).

randomprototypes

Signature: (int k; int n; [o]prototypes(k))

Generate a random set of $k prototype indices drawn from $n objects, ensuring that no object is used more than once. Calls checkprototypes().

See also: checkprototypes(), randomassign(), checkpartitions(), randompartition().

randompartition

Signature: (int k; int n; [o]partition(n))

Generate a partitioning of $n objects into $k clusters, ensuring that every cluster contains at least one object. Calls checkpartitions(). This method is identical in functionality to randomassign(), but may be faster if $k is significantly smaller than $n.

See also: randomassign(), checkpartitions(), checkprototypes(), randomprototypes().

COMMON ARGUMENTS

Many of the functions described above require one or more of the following parameters:

d

The number of features defined for each data element.

n

The number of data elements to be clustered.

k

nclusters

The number of desired clusters.

data(d,n)

A matrix representing the data to be clustered, double-valued.

mask(d,n)

A matrix indicating which data values are missing. If mask(i,j) == 0, then data(i,j) is treated as missing.

weights(d)

The (feature-) weights that are used to calculate the distance.

Warning: Not all distance metrics make use of weights; you must provide some nonetheless.

clusterids(n)

A clustering solution. $clusterids() maps data elements (row indices in $data()) to values in the range [0,$k-1].

Distance Metrics

Distances between data elements (and cluster centroids, where applicable) are computed using one of a number of built-in metrics. Which metric is to be used for a given computation is indicated by a character flag denoted above with $distFlag(). In the following, w[i] represents a weighting factor in the $weights() matrix, and $W represents the total of all weights.

Currently implemented distance metrics and the corresponding flags are:

e

Pseudo-Euclidean distance:

dist_e(x,y) = 1/W * sum_{i=1..d} w[i] * (x[i] - y[i])^2

Note that this is not the "true" Euclidean distance, which is defined as:

dist_E(x,y) = sqrt( sum_{i=1..d} (x[i] - y[i])^2 )

b

City-block ("Manhattan") distance:

dist_b(x,y) = 1/W * sum_{i=1..d} w[i] * |x[i] - y[i]|

c

Pearson correlation distance:

dist_c(x,y) = 1-r(x,y)

where r is the Pearson correlation coefficient:

r(x,y) = 1/d * sum_{i=1..d} (x[i]-mean(x))/stddev(x) * (y[i]-mean(y))/stddev(y)

a

Absolute value of the correlation,

dist_a(x,y) = 1-|r(x,y)|

where r(x,y) is the Pearson correlation coefficient.

u

Uncentered correlation (cosine of the angle):

dist_u(x,y) = 1-r_u(x,y)

where:

r_u(x,y) = 1/d * sum_{i=1..d} (x[i]/sigma0(x)) * (y[i]/sigma0(y))

and:

sigma0(w) = sqrt( 1/d * sum_{i=1..d} w[i]^2 )

x

Absolute uncentered correlation,

dist_x(x,y) = 1-|r_u(x,y)|

s

Spearman's rank correlation.

dist_s(x,y) = 1-r_s(x,y) ~= dist_c(ranks(x),ranks(y))

where r_s(x,y) is the Spearman rank correlation. Weights are ignored.

k

Kendall's tau (does not use weights).

dist_k(x,y) = 1 - tau(x,y)

(other values)

For other values of dist, the default (Euclidean distance) is used.

Link Methods

For hierarchical clustering, the 'link method' must be specified by a character flag, denoted above as $methodFlag. Known link methods are:

s

Pairwise minimum-linkage ("single") clustering.

Defines the distance between two clusters as the least distance between any two of their respective elements.

m

Pairwise maximum-linkage ("complete") clustering.

Defines the distance between two clusters as the greatest distance between any two of their respective elements.

a

Pairwise average-linkage clustering (centroid distance using arithmetic mean).

Defines the distance between two clusters as the distance between their respective centroids, where each cluster centroid is defined as the arithmetic mean of that cluster's elements.

c

Pairwise centroid-linkage clustering (centroid distance using median).

Identifies the distance between two clusters as the distance between their respective centroids, where each cluster centroid is computed as the median of that cluster's elements.

(other values)

Behavior for other values is currently undefined.

For the first three, either the distance matrix or the gene expression data is sufficient to perform the clustering algorithm. For pairwise centroid-linkage clustering, however, the gene expression data are always needed, even if the distance matrix itself is available.

ACKNOWLEDGEMENTS

Perl by Larry Wall.

PDL by Karl Glazebrook, Tuomas J. Lukka, Christian Soeller, and others.

C Clustering Library by Michiel de Hoon, Seiya Imoto, and Satoru Miyano.

Orignal Algorithm::Cluster module by John Nolan and Michiel de Hoon.

KNOWN BUGS

Dimensional requirements are sometimes too strict.

Passing weights to Spearman and Kendall link methods wastes space.

AUTHOR

Bryan Jurish <moocow@cpan.org> wrote and maintains the PDL::Cluster distribution.

Michiel de Hoon wrote the underlying C clustering library for cDNA microarray data.

COPYRIGHT

PDL::Cluster is a set of wrappers around the C Clustering library for cDNA microarray data.

The C clustering library for cDNA microarray data. Copyright (C) 2002-2005 Michiel Jan Laurens de Hoon.

This library was written at the Laboratory of DNA Information Analysis, Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan. Contact: michiel.dehoon 'AT' riken.jp

See the files REAMDE.cluster, cluster.c and cluster.h in the PDL::Cluster distribution for details.
PDL::Cluster wrappers copyright (C) Bryan Jurish 2005-2018. All rights reserved. This package is free software, and entirely without warranty. You may redistribute it and/or modify it under the same terms as Perl itself.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)