NAME

Algorithm::DBSCAN - (ALFA code) Perl implementation of the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm

SYNOPSIS

This module can be used to find clusters of points in a multidimensional space. More information can be found on Wikipedia: DBSCAN

The simple usage:

use Algorithm::DBSCAN;

my $points_data_file =     
    'point_1 56.514307478581514 37.146118456702034
    point_2 34.02049221667614 46.024651786417536
    point_3 23.473087508078684 60.62328221968349
    point_4 10.418513808840482 24.59808378533684
    point_5 10.583414831970764 25.902459835735534
    point_6 9.756855426925464 24.062840099892146
    point_7 10.567067873860672 22.32511341184489
    point_8 11.070046359352189 25.91278382647844
    point_9 9.537780590838175 25.000630928726288
    point_10 10.507367338512058 27.637356924097915
    point_11 11.949089580614444 30.67843911922257
    point_12 10.373548645248105 25.699863108892945
    point_13 47.061169019689615 12.482585189174058
    point_14 47.00269836645959 12.04880276389404
    point_15 47.197663384856476 12.899232975457025
    point_16 44.3719178488551 15.41709269630616
    point_17 46.31921200316786 12.556849509965417
    point_18 44.128763621333135 14.657970021594974
    point_19 48.89953587475758 15.183892607591467
    point_20 52.15333345222132 16.354597634497154
    point_21 50.03978361242539 14.85901473647285';

my $dataset = Algorithm::DBSCAN::Dataset->new();
my @lines = split(/\n\s+/, $points_data_file);
foreach my $line (@lines) {
    $dataset->AddPoint(new Algorithm::DBSCAN::Point(split(/\s+/, $line)));
}

my $dbscan = Algorithm::DBSCAN->new($dataset, 4 * 4, 2);

$dbscan->FindClusters();
$dbscan->PrintClustersShort();

If you have huge datasets and want to use multiple CPUs in a optimal way you can build the region index with an external tool (will soon be available). En axample of code that uses a region index would be as follow.

Given the dataset:

point_1 56 37
point_2 34 46
point_3 23 60
point_4 10 24
point_5 10 25
point_6 9 24
point_7 10 22
point_8 11 25
point_9 9 25
point_10 10 27
point_11 11 30
point_12 10 25
point_13 47 12
point_14 47 12
point_15 47 12
point_16 44 15
point_17 46 12
point_18 44 14
point_19 48 15
point_20 52 16
point_21 50 14

The region index with $eps = 4 x 4 and $min_distance = 2 would look like this:

0 0
1 1
2 2
3 3 4 5 6 7 8 9 11
4 3 4 5 6 7 8 9 11
5 3 4 5 6 7 8 9 11
10 9 10
12 12 13 14 16 17 18 20
11 3 4 5 6 7 8 9 11
13 12 13 14 16 17 18 20
14 12 13 14 16 17 18 20
15 15 16 17
16 12 13 14 15 16 17 18
18 12 13 14 16 18 20
7 3 4 5 6 7 8 9 11
20 12 13 14 18 19 20
6 3 4 5 6 7 8 11
17 12 13 14 15 16 17
19 19 20
8 3 4 5 6 7 8 9 11
9 3 4 5 7 8 9 10 11

To use this index you can use the following code:

  use Algorithm::DBSCAN;
  
  my $points_data_file =     
      'point_1 56 37
      point_2 34 46
      point_3 23 60
      point_4 10 24
      point_5 10 25
      point_6 9 24
      point_7 10 22
      point_8 11 25
      point_9 9 25
      point_10 10 27
      point_11 11 30
      point_12 10 25
      point_13 47 12
      point_14 47 12
      point_15 47 12
      point_16 44 15
      point_17 46 12
      point_18 44 14
      point_19 48 15
      point_20 52 16
      point_21 50 14';

  my $dataset = Algorithm::DBSCAN::Dataset->new();
  my @lines = split(/\n\s+/, $points_data_file);
  foreach my $line (@lines) {
      $dataset->AddPoint(new Algorithm::DBSCAN::Point(split(/\s+/, $line)));
  }

  my $dbscan = Algorithm::DBSCAN->new($dataset, 4 * 4, 2);

  $dbscan->UseRegionIndex(the filename containg the previous region index);
  $dbscan->FindClusters();
  $dbscan->PrintClustersShort();

SUBROUTINES/METHODS

new

The constructor takes 3 parameters:

$dataset: The Algorithm::DBSCAN::Dataset dataset object
    
    Create the Dataset object:
        my $dataset = Algorithm::DBSCAN::Dataset->new();
    
    Add points (the first parameter is a point_id the other are point coordinates)
        $dataset->AddPoint(new Algorithm::DBSCAN::Point('point_1', 1, 2, 3, 4, 5);
        
$eps: The epsilon parameter used for region density computation
    WARNING: This implementation uses the sqare distance between the points to avoid 
    a useless square root call. If you want to use the euclidian distance you need to 
    convert it to the right value yourself.
    
    For example for the previous point with 5 dimensions 
    $eps = $euclidian_distance * $euclidian_distance * $euclidian_distance * $euclidian_distance * $euclidian_distance; 
    
$min_points: the minimal number of points in a region with a radius of $eps. $eps 
and $min_points are the 2 parameters used to compute the denisty of a region. If 
the number of points in a region with radius $eps is lower than $min_points the 
point is considered as an outlier point that can't be included in any cluster.

FindClusters

The main method that will run the DBSCAN algorithm on the Dataset.

ExpandCluster

This method will expand the cluster starting by the neighborhood of point $point

GetRegion

Find all points in the dataset that are in the neighborhood of $point

UseRegionIndex

For huge datasets a region index can be generated separately (and using multiple cores). The index is a list of regions for each point in the dataset.

PrintClusters

Will print the contents of the clusters

PrintClustersShort

Will print the contents of the clusters (abreviated version)

_one_more_point_visited

Simple method used to display progress

AUTHOR

Michal TOMA, <mtoma at cpan.org>

BUGS

Please report any bugs or feature requests on github: https://github.com/mtoma/Algorithm-DBSCAN

By e-mail to bug-algorithm-dbscan at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Algorithm-DBSCAN. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc Algorithm::DBSCAN

You can also look for information at:

LICENSE AND COPYRIGHT

Copyright 2016 Michal TOMA.

This program is free software; you can redistribute it and/or modify it under the terms of the the Artistic License (2.0). You may obtain a copy of the full license at:

http://www.perlfoundation.org/artistic_license_2_0

Any use, modification, and distribution of the Standard or Modified Versions is governed by this Artistic License. By using, modifying or distributing the Package, you accept this license. Do not use, modify, or distribute the Package, if you do not accept this license.

If your Modified Version has been derived from a Modified Version made by someone other than you, you are nevertheless required to ensure that your Modified Version complies with the requirements of this license.

This license does not grant you the right to use any trademark, service mark, tradename, or logo of the Copyright Holder.

This license includes the non-exclusive, worldwide, free-of-charge patent license to make, have made, use, offer to sell, sell, import and otherwise transfer the Package with respect to any patent claims licensable by the Copyright Holder that are necessarily infringed by the Package. If you institute patent litigation (including a cross-claim or counterclaim) against any party alleging that the Package constitutes direct or contributory patent infringement, then this Artistic License to you shall terminate on the date that such litigation is filed.

Disclaimer of Warranty: THE PACKAGE IS PROVIDED BY THE COPYRIGHT HOLDER AND CONTRIBUTORS "AS IS' AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES. THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT ARE DISCLAIMED TO THE EXTENT PERMITTED BY YOUR LOCAL LAW. UNLESS REQUIRED BY LAW, NO COPYRIGHT HOLDER OR CONTRIBUTOR WILL BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING IN ANY WAY OUT OF THE USE OF THE PACKAGE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.