NAME

Data::Pareto - Computing Pareto sets in Perl

VERSION

Version 0.04

SYNOPSIS

use Data::Pareto;

# only first and third columns are used in comparison
# the others are simply descriptive
my $set = new Data::Pareto( { columns => [0, 2] } );
$set->add(
    [ 5, "pareto", 10, 11 ],
    [ 5, "dominated", 11, 9 ],
    [ 4, "pareto2", 12, 12 ] 
);

# this returns [ [ 5, "pareto", 10, 11 ], [ 4, "pareto2", 12, 12 ] ],
# the other one is dominated on selected columns
$set->get_pareto_ref;

DESCRIPTION

This module makes calculation of Pareto set. Given a set of vectors (i.e. arrays of simple scalars), Pareto set is all the vectors from the given set which are not dominated by any other vector of the set. A vector X is said to be dominated by Y, iff X[i] >= Y[i] for all i and X[i] > Y[i] for at least one i.

Pareto sets play an important role in multiobjective optimization, where each non-dominated (i.e. Pareto) vector describes objectives value of "optimal" solution to the given problem.

This module allows occurrence of duplicates in the set - this makes it rather a bag than a set, but is useful in practice (e.g. when we want to preserve two solutions giving the same objectives value, but structurally different). This assumption influences dominance definition given above: two duplicates never dominate each other and hence can be present in the Pareto set. This is controlled by duplicates option passed to new(): if set to true value, duplicates are allowed in Pareto set; otherwise, only the first found element of the subset of duplicated vectors is preserved in Pareto set.

The values are allowed to be invalid. The meaning of 'invalid' is 'the worst possible'. It's different concept than 'unknown'; unknown value make the definition of domination less clear.

By default, the comparison of column values is numerical and the smaller value dominates the larger one. If you want to override this behaviour, pass your own dominator sub in arguments to new().

FUNCTIONS

By default, a vector is passed around as a ref to array of consecutive column values. This means you shouldn't mess with it after passing to add method.

new

Creates a new object for calculating Pareto set.

The first argument passed is a hashref with options; the recognized options are:

  • columns

    Arrayref containing column numbers which should be used for determining domination and duplication. Column numbers are 0-based array indexes to data vectors.

    Only values at those positions will be ever compared between vectors. Any other data in the vectors may be present and is not used in any way.

    At least one column number should be passed, for obvious reasons.

  • duplicates

    If set to true value, duplicated vectors are all put in Pareto set (if they are Pareto, of course). If set to false, duplicates of vectors already in the Pareto set are discarded.

  • invalid

    The value considered invalid in pareto set. Such value is dominated by any value and dominates only invalid value.

    However, computations of domination in presence of invalid values can be considerably slower, as much as 5 times. So it probably will be faster to first parse the data and replace invalid markers with some huge-and-surely-dominated values.

  • column_dominator

    The sub(s) used to compare specific column values and determining domination between them. Scalar, sub ref or hash ref. If not set, the default is that the numerically smaller value dominates the other one.

    When the scalar is passed, it is assumed to be the name of a predefined dominator. This is a much faster option to specifying the sub of your own. Recognized dominators are:

    • min numerically smaller value dominates

    • max numerically greater value dominates

    • lexi earlier in collation order value dominates (lexicographical order)

    • lexi_rev later in collation order value dominates (reversed lexicographical order)

    • std standard, i.e. min dominator

    During creation of Pareto set, the dominator sub is called with three arguments: column number, first vector's value, second vector's value, and should return true, when the second value dominates the first one, assuming they appeared in the specified column.

    Make sure that your sub returns true when two passed values are the same. This is necessary to obey the whole Pareto set domination contract.

    There are two approaches possible when the values in different columns are of different types, in the sense of domination. First, you can use passed column number to decide the domination check function. Alternatively, you can pass a hash ref with mapping from the column number to the sub ref used to compare the given column:

    my $lexi_dominator = sub {
        my ($col, $dominated, $by) = @_;
        return ($dominated ge $by);
    };
    my $min_dominator = sub {
        my ($col, $dominated, $by) = @_;
        return ($dominated >= $by);
    }
    
    my $set = new Data::Pareto({
    	  columns => [0, 2],
    	  column_dominator => {
    	  	  0 => $lexi_dominator,
    	  	  2 => $min_dominator
    	  }
    });
    $set->add(['a', 'label 1', 12], ['b', 'label 2', 9]);

The rest of arguments are assumed to be vectors, and passed to add() method.

add

Tests vectors passed as arguments and adds the non-dominated ones to the Pareto set.

get_pareto

Returns the current content of Pareto set as a list of vectors.

get_pareto_ref

Returns the current content of Pareto set as a ref to array with vectors. The return value references the original array, so treat it as read-only!

is_dominated

Checks if the first vector passed is dominated by the second one. The comparison is made based on the values in vectors' columns, which were passed to new().

The vectors passed are never duplicates of each other when this method is called from inside this module.

Returns true, when the first vector from arguments list is dominated by the other one, and false otherwise.

is_invalid

Checks if the given value is considered invalid for the current object. Every value is valid by default.

TODO

Allow specifying built-in dominators inside dominator hash.

For large data sets calculations become time-intensive. There are a couple of techniques which might be applied to improve the performance:

  • defer the phase of removing vectors dominated by newly added vectors to get_pareto() call; this results in smaller number of arrays rewritings.

  • split the set of vectors being added into smaller subsets, calculate Pareto sets for such subsets, and then apply insertion of resulting Pareto subsets to the main set; this results in smaller number of useless tries of adding dominated vectors into the set.

AUTHOR

Przemyslaw Wesolek, <jest at go.art.pl>

BUGS

Please report any bugs or feature requests to bug-data-pareto at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Data-Pareto. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc Data::Pareto

You can also look for information at:

COPYRIGHT & LICENSE

Copyright 2009 Przemyslaw Wesolek, all rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.