NAME
Data::Pareto - Computing Pareto sets in Perl
VERSION
Version 0.05
SYNOPSIS
use Data::Pareto;
# only first and third columns are used in comparison
# the others are simply descriptive
my $set = new Data::Pareto( { columns => [0, 2] } );
$set->add(
[ 5, "pareto", 10, 11 ],
[ 5, "dominated", 11, 9 ],
[ 4, "pareto2", 12, 12 ]
);
# this returns [ [ 5, "pareto", 10, 11 ], [ 4, "pareto2", 12, 12 ] ],
# the other one is dominated on selected columns
$set->get_pareto_ref;
DESCRIPTION
This module makes calculation of Pareto set. Given a set of vectors (i.e. arrays of simple scalars), Pareto set is all the vectors from the given set which are not dominated by any other vector of the set. A vector X
is said to be dominated by Y
, iff X[i] >= Y[i]
for all i
and X[i] > Y[i]
for at least one i
.
Pareto sets play an important role in multiobjective optimization, where each non-dominated (i.e. Pareto) vector describes objectives value of "optimal" solution to the given problem.
This module allows occurrence of duplicates in the set - this makes it rather a bag than a set, but is useful in practice (e.g. when we want to preserve two solutions giving the same objectives value, but structurally different). This assumption influences dominance definition given above: two duplicates never dominate each other and hence can be present in the Pareto set. This is controlled by duplicates
option passed to new(): if set to true
value, duplicates are allowed in Pareto set; otherwise, only the first found element of the subset of duplicated vectors is preserved in Pareto set.
The values are allowed to be invalid. The meaning of 'invalid' is 'the worst possible'. It's different concept than 'unknown'; unknown value make the definition of domination less clear.
By default, the comparison of column values is numerical and the smaller value dominates the larger one. If you want to override this behaviour, pass your own dominator sub in arguments to new().
FUNCTIONS
By default, a vector is passed around as a ref to array of consecutive column values. This means you shouldn't mess with it after passing to add
method.
new
Creates a new object for calculating Pareto set.
The first argument passed is a hashref with options; the recognized options are:
columns
Arrayref containing column numbers which should be used for determining domination and duplication. Column numbers are
0
-based array indexes to data vectors.Only values at those positions will be ever compared between vectors. Any other data in the vectors may be present and is not used in any way.
At least one column number should be passed, for obvious reasons.
duplicates
If set to
true
value, duplicated vectors are all put in Pareto set (if they are Pareto, of course). If set tofalse
, duplicates of vectors already in the Pareto set are discarded.invalid
The value considered invalid in pareto set. Such value is dominated by any value and dominates only invalid value.
However, computations of domination in presence of invalid values can be considerably slower, as much as 5 times. So it probably will be faster to first parse the data and replace invalid markers with some huge-and-surely-dominated values.
column_dominator
The sub(s) used to compare specific column values and determining domination between them. Scalar, sub ref or hash ref. If not set, the default is that the numerically smaller value dominates the other one.
When the scalar is passed, it is assumed to be the name of a predefined dominator. This is a much faster option to specifying the sub of your own. Recognized dominators are:
min
numerically smaller value dominatesmax
numerically greater value dominateslexi
earlier in collation order value dominates (lexicographical order)lexi_rev
later in collation order value dominates (reversed lexicographical order)std
standard, i.e.min
dominator
During creation of Pareto set, the dominator sub is called with three arguments: column number, first vector's value, second vector's value, and should return
true
, when the second value dominates the first one, assuming they appeared in the specified column.Make sure that your sub returns
true
when two passed values are the same. This is necessary to obey the whole Pareto set domination contract.There are two approaches possible when the values in different columns are of different types, in the sense of domination. First, you can use passed column number to decide the domination check function. Alternatively, you can pass a hash ref with mapping from the column number to the sub ref used to compare the given column:
my $lexi_dominator = sub { my ($col, $dominated, $by) = @_; return ($dominated ge $by); }; my $min_dominator = sub { my ($col, $dominated, $by) = @_; return ($dominated >= $by); } my $set = new Data::Pareto({ columns => [0, 2], column_dominator => { 0 => $lexi_dominator, 2 => $min_dominator } }); $set->add(['a', 'label 1', 12], ['b', 'label 2', 9]);
The rest of arguments are assumed to be vectors, and passed to add() method.
add
Tests vectors passed as arguments and adds the non-dominated ones to the Pareto set.
get_pareto
Returns the current content of Pareto set as a list of vectors.
get_pareto_ref
Returns the current content of Pareto set as a ref to array with vectors. The return value references the original array, so treat it as read-only!
is_dominated
Checks if the first vector passed is dominated by the second one. The comparison is made based on the values in vectors' columns, which were passed to new().
The vectors passed are never duplicates of each other when this method is called from inside this module.
Returns true
, when the first vector from arguments list is dominated by the other one, and false
otherwise.
is_invalid
Checks if the given value is considered invalid for the current object. Every value is valid by default.
TODO
Allow specifying built-in dominators inside dominator hash.
For large data sets calculations become time-intensive. There are a couple of techniques which might be applied to improve the performance:
defer the phase of removing vectors dominated by newly added vectors to get_pareto() call; this results in smaller number of arrays rewritings.
split the set of vectors being added into smaller subsets, calculate Pareto sets for such subsets, and then apply insertion of resulting Pareto subsets to the main set; this results in smaller number of useless tries of adding dominated vectors into the set.
AUTHOR
Przemyslaw Wesolek, <jest at go.art.pl>
BUGS
Please report any bugs or feature requests to bug-data-pareto at rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Data-Pareto. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc Data::Pareto
You can also look for information at:
RT: CPAN's request tracker
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
Search CPAN
COPYRIGHT & LICENSE
Copyright 2009 Przemyslaw Wesolek
This program is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0. For details, see the full text of the license in the file LICENSE.