



















DBSCAN 简单来说就是一种基于密度的聚类算法。
数据输入支持weka数据格式,里面有一个例子数据,结果与weka比较过,是相同的。
网上有一个DBSCAN的C#的源码,但是错的。
现无偿贡献出来,没有任何协议,可以随意引用,当然了,能发封信给我或者发个明信片啥的更好了,嘿嘿。
From Weikpedia
DBScan requires two parameters: epsilon (eps) and minimum points (minPts). It starts with an arbitrary starting point that has not been visited. It then finds all the neighbour points within distance eps of the starting point.
If the number of neighbors is greater than or equal to minPts, a cluster is formed. The starting point and its neighbors are added to this cluster and the starting point is marked as visited. The algorithm then repeats the evaluation process for all the neighbours recursively.
If the number of neighbors is less than minPts, the point is marked as noise.
If a cluster is fully expanded (all points within reach are visited) then the algorithm proceeds to iterate through the remaining unvisited points in the dataset.
C = 0
for each unvisited point P in dataset D
N = getNeighbors (P, epsilon)
if (sizeof(N) < minPts)
mark P as NOISE
else
++C
mark P as visited
add P to cluster C
recurse (N)
Compare to other clustering algorithms, such as K-Means, DBSCAN does not scale very well in data volumen. If more data points are added to the clustering data set the slower the algorithms gets due to the nature of the getNeighbors(P,epsilon) function.
A nearest neighborhood detection takes place in the getNeighbors(P,epsilon) funktion. For a data point P all data points must be detected that are within the range of epsilon, based on the distance function used in the algorithm. The detection requires that a distance matrix is calculated for the whole data set. The generation of the distance matrix has a complexity of O((n2-n)/2) since only an upper matrix is needed. Within the distance matrix the nearest neighbors can be detected by selecting a tuple with minimums functions over the rows and columns. Research has been pushing the neighborhood detection into traditional databases to enhance the speed of the detection. Databases solve the neighborhood problem with indexes specifically designed for this type of application
Every data mining task has the problem of parameters. Every parameter influences the algorithm in sepcifc ways. For DBSCAN the parameters epsilon and MinPnts are needed. The parameters must be specified by the user of the algorithms since other data sets and other questions require differnt parameters. An initial value for epsilon can be determined by a k-distance graph. As a thumbrule k is determined with D (number of dimensions in the data set): k=D+1.
Although this parameter estimation gives a sufficient inital parameter set the resulting clustering can turn out to be not the expected partitioning. Therfore research has been performed on incrementally optimising the parameters agains a specific target value.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。