NAME
Algorithm::DBSCAN - (ALFA code) Perl implementation of the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm
SYNOPSIS
This module can be used to find clusters of points in a multidimensional space. More information can be found on Wikipedia: DBSCAN
The simple usage:
use Algorithm::DBSCAN;
my $points_data_file =
'point_1 56.514307478581514 37.146118456702034
point_2 34.02049221667614 46.024651786417536
point_3 23.473087508078684 60.62328221968349
point_4 10.418513808840482 24.59808378533684
point_5 10.583414831970764 25.902459835735534
point_6 9.756855426925464 24.062840099892146
point_7 10.567067873860672 22.32511341184489
point_8 11.070046359352189 25.91278382647844
point_9 9.537780590838175 25.000630928726288
point_10 10.507367338512058 27.637356924097915
point_11 11.949089580614444 30.67843911922257
point_12 10.373548645248105 25.699863108892945
point_13 47.061169019689615 12.482585189174058
point_14 47.00269836645959 12.04880276389404
point_15 47.197663384856476 12.899232975457025
point_16 44.3719178488551 15.41709269630616
point_17 46.31921200316786 12.556849509965417
point_18 44.128763621333135 14.657970021594974
point_19 48.89953587475758 15.183892607591467
point_20 52.15333345222132 16.354597634497154
point_21 50.03978361242539 14.85901473647285';
my $dataset = Algorithm::DBSCAN::Dataset->new();
my @lines = split(/\n\s+/, $points_data_file);
foreach my $line (@lines) {
$dataset->AddPoint(new Algorithm::DBSCAN::Point(split(/\s+/, $line)));
}
my $dbscan = Algorithm::DBSCAN->new($dataset, 4 * 4, 2);
$dbscan->FindClusters();
$dbscan->PrintClustersShort();
If you have huge datasets and want to use multiple CPUs in a optimal way you can build the region index with an external tool (will soon be available). En axample of code that uses a region index would be as follow.
Given the dataset:
point_1 56 37
point_2 34 46
point_3 23 60
point_4 10 24
point_5 10 25
point_6 9 24
point_7 10 22
point_8 11 25
point_9 9 25
point_10 10 27
point_11 11 30
point_12 10 25
point_13 47 12
point_14 47 12
point_15 47 12
point_16 44 15
point_17 46 12
point_18 44 14
point_19 48 15
point_20 52 16
point_21 50 14
The region index with $eps = 4 x 4 and $min_distance = 2 would look like this:
0 0
1 1
2 2
3 3 4 5 6 7 8 9 11
4 3 4 5 6 7 8 9 11
5 3 4 5 6 7 8 9 11
10 9 10
12 12 13 14 16 17 18 20
11 3 4 5 6 7 8 9 11
13 12 13 14 16 17 18 20
14 12 13 14 16 17 18 20
15 15 16 17
16 12 13 14 15 16 17 18
18 12 13 14 16 18 20
7 3 4 5 6 7 8 9 11
20 12 13 14 18 19 20
6 3 4 5 6 7 8 11
17 12 13 14 15 16 17
19 19 20
8 3 4 5 6 7 8 9 11
9 3 4 5 7 8 9 10 11
To use this index you can use the following code:
use Algorithm::DBSCAN;
my $points_data_file =
'point_1 56 37
point_2 34 46
point_3 23 60
point_4 10 24
point_5 10 25
point_6 9 24
point_7 10 22
point_8 11 25
point_9 9 25
point_10 10 27
point_11 11 30
point_12 10 25
point_13 47 12
point_14 47 12
point_15 47 12
point_16 44 15
point_17 46 12
point_18 44 14
point_19 48 15
point_20 52 16
point_21 50 14';
my $dataset = Algorithm::DBSCAN::Dataset->new();
my @lines = split(/\n\s+/, $points_data_file);
foreach my $line (@lines) {
$dataset->AddPoint(new Algorithm::DBSCAN::Point(split(/\s+/, $line)));
}
my $dbscan = Algorithm::DBSCAN->new($dataset, 4 * 4, 2);
$dbscan->UseRegionIndex(the filename containg the previous region index);
$dbscan->FindClusters();
$dbscan->PrintClustersShort();
SUBROUTINES/METHODS
new
The constructor takes 3 parameters:
$dataset: The Algorithm::DBSCAN::Dataset dataset object
Create the Dataset object:
my $dataset = Algorithm::DBSCAN::Dataset->new();
Add points (the first parameter is a point_id the other are point coordinates)
$dataset->AddPoint(new Algorithm::DBSCAN::Point('point_1', 1, 2, 3, 4, 5);
$eps: The epsilon parameter used for region density computation
WARNING: This implementation uses the sqare distance between the points to avoid
a useless square root call. If you want to use the euclidian distance you need to
convert it to the right value yourself.
For example for the previous point with 5 dimensions
$eps = $euclidian_distance * $euclidian_distance * $euclidian_distance * $euclidian_distance * $euclidian_distance;
$min_points: the minimal number of points in a region with a radius of $eps. $eps
and $min_points are the 2 parameters used to compute the denisty of a region. If
the number of points in a region with radius $eps is lower than $min_points the
point is considered as an outlier point that can't be included in any cluster.
FindClusters
The main method that will run the DBSCAN algorithm on the Dataset.
ExpandCluster
This method will expand the cluster starting by the neighborhood of point $point
GetRegion
Find all points in the dataset that are in the neighborhood of $point
UseRegionIndex
For huge datasets a region index can be generated separately (and using multiple cores). The index is a list of regions for each point in the dataset.
PrintClusters
Will print the contents of the clusters
PrintClustersShort
Will print the contents of the clusters (abreviated version)
_one_more_point_visited
Simple method used to display progress
AUTHOR
Michal TOMA, <mtoma at cpan.org>
BUGS
Please report any bugs or feature requests on github: https://github.com/mtoma/Algorithm-DBSCAN
By e-mail to bug-algorithm-dbscan at rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Algorithm-DBSCAN. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc Algorithm::DBSCAN
You can also look for information at:
Github: Issues (report bugs here)
RT: CPAN's request tracker (report bugs here)
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
Search CPAN
LICENSE AND COPYRIGHT
Copyright 2016 Michal TOMA.
This program is free software; you can redistribute it and/or modify it under the terms of the the Artistic License (2.0). You may obtain a copy of the full license at:
http://www.perlfoundation.org/artistic_license_2_0
Any use, modification, and distribution of the Standard or Modified Versions is governed by this Artistic License. By using, modifying or distributing the Package, you accept this license. Do not use, modify, or distribute the Package, if you do not accept this license.
If your Modified Version has been derived from a Modified Version made by someone other than you, you are nevertheless required to ensure that your Modified Version complies with the requirements of this license.
This license does not grant you the right to use any trademark, service mark, tradename, or logo of the Copyright Holder.
This license includes the non-exclusive, worldwide, free-of-charge patent license to make, have made, use, offer to sell, sell, import and otherwise transfer the Package with respect to any patent claims licensable by the Copyright Holder that are necessarily infringed by the Package. If you institute patent litigation (including a cross-claim or counterclaim) against any party alleging that the Package constitutes direct or contributory patent infringement, then this Artistic License to you shall terminate on the date that such litigation is filed.
Disclaimer of Warranty: THE PACKAGE IS PROVIDED BY THE COPYRIGHT HOLDER AND CONTRIBUTORS "AS IS' AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES. THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT ARE DISCLAIMED TO THE EXTENT PERMITTED BY YOUR LOCAL LAW. UNLESS REQUIRED BY LAW, NO COPYRIGHT HOLDER OR CONTRIBUTOR WILL BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING IN ANY WAY OUT OF THE USE OF THE PACKAGE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.