NAME

Bio::FdrFet - Perl extension for False Discovery Rate and Fisher Exact Test applied to pathway analysis.

SYNOPSIS

  use Bio::FdrFet;
  my $obj = new Bio::FdrFet($fdrcutoff);

  open (IN, $pathwayfile) ||
      die "can not open pathway annotation file $pathwayfile: $!\n";
  while (<IN>) {
      chomp;
      my ($gene, $dbacc, $desc, $rest) = split (/\t/, $_, 4);
      $obj->add_to_pathway("gene" => $gene,
	   		   "dbacc" => $dbacc,
			   "desc" => $desc);
  }
  close IN;

  #read in genes and associated p values
  my (%genes, @fdrs);
  open (IN, $genefile) || die "can not open gene file $genefile: $!\n";
  my $ref_size = 0;
  while (<IN>) {
      my ($gene, $pval) = split (/\t/, $_);
      $obj->add_to_genes("gene" => $gene,
  		         "pval" => $pval);
      $ref_size++;
  }
  close IN;
  $obj->gene_count($ref_size);
  $obj->calculate;
  foreach my $pathway ($obj->pathways('sorted')) {
      my $logpval = $obj->pathway_result($pathway, 'LOGPVAL');
      printf "Pathway $pathway %s has - log(pval) = %6.4f",
             $obj->pathway_desc($pathway),
             $logpval;
  }

Constructor

$obj = new Bio::FdrFet($fdrcutoff);

# You can also use $obj = new Bio::FdrFet->new($fdrcutoff);

Object Methods

Input Methods

$obj->fdr_cutoff($new_cutoff);
$obj->universe($universe_option);
$obj->verbose($new_verbose);
$obj->add_to_pathway("gene" => <gene_name>,
                     "dbacc" => <pathway_accession>,
                     "desc" => <pathway description>);
$obj->add_to_genes("gene" => <gene_name>,
                   "pval" => <probability_value>);

Output Methods

$obj->genes;
$obj->pathways[($order)];
$obj->pathway_result($pathway, $data_name[, $all_flag]);
$obj->pathway_desc($pathway);
$obj->pathway_genes($pathway);
$obj->fdr_position($fdr);

Other Methods

$obj->gene_count[($fet_gene_count)];
$obj->calculate;

DESCRIPTION

Bio::FdrFet implements the False Discovery Rate Fisher Exact Test of gene expression analysis applied to pathways described in the paper by Ruiru Ji, Karl-Heinz Ott, Roumyana Yordanova, and Robert E Bruccoleri. A copy of the paper is included with the distribution in the file, Fdr-Fet-Manuscript.pdf.

The module is implemented using a simple object oriented paradigm where the object stores all the information needed for a calculation along with a state variable, STATE. The state variable has two possible values, 'setup' and 'calculated'. The value of 'setup' indicates that the object is being setup with data, and any results in the object are inconsistent with the data. The value of 'calculated' indicates that the object's computational results are consistent with the data, and may be returned to a calling program.

The 'calculate' method is used to update all the calculated values from the input data. It checks the state variable first, and only does the calculation if the state is 'setup'. Once the calculations are complete, then the state variable is set to 'calculated'. Thus, the calculate method can be called whenever a calculated value is needed, and there is no performance penalty.

The module initializes the Bio::FdrFet object with a state of 'setup'. Any data input sets the state to 'setup'. Any requests for calculated data, calls 'calculate', which updates the state variable so futures requests for calculated data return quickly.

METHODS

The following methods are provided:

new([$fdrcutoff])

Creates a new Bio::FdrFet object. The optional parameter is the False Discovery Rate cutoff in units of percent. See the fdr_cutoff method below for more details.

fdr_cutoff([$fdrcutoff])

Retrieves the current setting for the False Discovery Rate threshold, and optionally sets it. This threshold must be an integer greater than 0 and less than or equal to 100, and is divided by 100 for the value used by the computation.

verbose([$verbose_mode])

Retrieves the current setting for the verbose parameter and optionally sets it. It can be either 0, no verbosity, or 1, lots of messages sent to STDERR.

universe([$universe_option])

Retrieves the current setting for the universe option and optionally sets it. The universe option specifies how the number of genes in our statistical universe is calculated. There are four possible settings to this option:

union

The universe is the union of all gene names specified individually and in pathways. Genes which have no P value are counted in the universe, but they are not counted as regulated in the FET or FDR calculations.

genes

Only genes specified by the add_to_genes method count.

intersection

Only genes in the intersection of the gene list and pathways are used for the universe calculation

user

The user specifies the universe size by calling the gene_count method with an argument.

add_to_pathway(
"gene" => <gene_name>,
"dbacc" => <pathway_accession>,
"desc" => <pathway description>)

Adds a gene to a pathway and also defines a pathway. The arguments are specified as a pseudo hash, with each argument being preceded by its name.

Pathways are defined by an accession key (dbacc parameter), a description (desc parameter, and a set of genes (specified by the geneparameter). To use this function to specify a pathway with multiple genes, you call this method multiple times with the same accession key and description, and vary the gene name. The gene names are just arbitrary strings, but they must match the values used for specifying Probability Values (pvalues) used by the add_to_genes method.

add_to_genes(
"gene" => <gene_name>,
"pval" => <probability value>)

Adds a probability values for a gene in the calculation. The arguments are specified using a pseudo hash with the nameof parameter preceding its value. The gene names must match those used in the pathways. The probability values are estimates of non-randomness and should range from 0 to 1.

genes()

Returns the list of gene names in the system.

pathways[($order)]

Returns the list of pathways. If the optional argument, $order, is specified and contains the word, "sorted", (comparison is case insensitive), then the object will return the pathways in order of most significant to least. If sorting is done, then the object will update the calculation of probability values, whereas if no sorting is done, then the object does no calculation.

pathway_result($pathway, $data_name[, $all_flag])

Returns a calculated result for a pathway. The following values may be used for $data_name. Case of $data_name does not matter.

LOGPVAL   -log10(probability value for pathway).
PVAL      probability value for pathway
ODDS      Odds ratio. 
Q         Number of genes in the pathway passing the FDR cutoff
M         Number of genes overall passing the FDR cutoff
N         Number of genes in the system minus C<M> above.
K         Number of genes in the pathway.
FDR       FDR cutoff in percent giving the best pvalue.
LOCI      Reference to an array of gene names in the pathway
          that satisfy FDR cutoff.

If $all_flag is specified and has the value, "all", then this returns an array of values for all the attempted FDR cutoffs, except for the c<FDR> cutoff.

pathway_desc($pathway)

Returns the description field of the specified pathway.

pathway_genes($pathway)

Returns an array containing the genes of the specified pathway.

fdr_position($fdr)

Returns the position in the gene list for a specific FDR value. The $fdr variable must be an integer between 1 and the FDR cutoff.

gene_count([$fet_gene_count])

Returns the count of genes in the system which the size of the union of the gene names seen from both the add_to_genes and add_to_pathway methods. This value is used in the Fisher Exact Test calculation. You can change the total gene count value used in the calculation by specifying a parameter to this method.

calculate()

Run the FDR FET calculation.

EXPORT

None by default.

SEE ALSO

The FDR FET paper included with the source code.

AUTHORS

Robert Bruccoleri <bruc@acm.org>
Ruiru Ji <ruiru.ji@bms.com>
Karl-Heinz Ott <karl-heinz.ott@bms.com>
Roumyana Yordanova <roumyana.yordanova@bms.com>

ACKNOWLEDGEMENT

We thank Douglas B. Craig, Division of Clinical Pharmacology and Toxicology, Children's Hospital of Michigan, Detroit, MI for finding a correcting a bug in the FDR implementation.

COPYRIGHT AND LICENSE

Copyright (C) 2009 by Bristol-Myers Squibb Company and Congenomics LLC.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.