惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

T
Threatpost
V
Vulnerabilities – Threatpost
TaoSecurity Blog
TaoSecurity Blog
C
Cybersecurity and Infrastructure Security Agency CISA
P
Proofpoint News Feed
G
GRAHAM CLULEY
S
Securelist
P
Palo Alto Networks Blog
MongoDB | Blog
MongoDB | Blog
A
Arctic Wolf
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
WordPress大学
WordPress大学
Project Zero
Project Zero
T
Threat Research - Cisco Blogs
L
Lohrmann on Cybersecurity
C
Cyber Attacks, Cyber Crime and Cyber Security
F
Fortinet All Blogs
博客园 - 叶小钗
B
Blog RSS Feed
C
Cisco Blogs
Google DeepMind News
Google DeepMind News
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
Apple Machine Learning Research
Apple Machine Learning Research
G
Google Developers Blog
K
Kaspersky official blog
D
Docker
Latest news
Latest news
Cisco Talos Blog
Cisco Talos Blog
T
Tor Project blog
Cyberwarzone
Cyberwarzone
Security Latest
Security Latest
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
D
Darknet – Hacking Tools, Hacker News & Cyber Security
Spread Privacy
Spread Privacy
Microsoft Azure Blog
Microsoft Azure Blog
C
Check Point Blog
J
Java Code Geeks
Simon Willison's Weblog
Simon Willison's Weblog
T
Tenable Blog
Recent Announcements
Recent Announcements
T
Tailwind CSS Blog
H
Help Net Security
L
LINUX DO - 热门话题
T
The Exploit Database - CXSecurity.com
Jina AI
Jina AI
S
SegmentFault 最新的问题
MyScale Blog
MyScale Blog
NISL@THU
NISL@THU
美团技术团队
腾讯CDC

博客园 - Seraph

共享源码 统计多个word文件中的总字数 惭愧啊,居然两年没有写Blog了。翻看从前的东西恍惚隔世 DNN WebMail Module 开发 —— 第一篇 SharpWebMail 调试篇 开始研究DNN了。发现自己落伍了很多 在 ASP.NET 中执行 URL 重写(zz) XML转义 C#中的多線程 ASP 字符串函数 C# escape sequence - Seraph C#字符串的使用笔记 Request Validation - 防止脚本攻击 关于一个算法,看看大家有没有好办法。 写了一个进制转换的小函数,javascript的 发现好久没有写随笔了。 都是一些小东西(1)一个InputBox 谁知道怎么对一个asp.net的project加注册 天子呼来不上船,自称臣是酒中仙——我的嗜酒情节 两篇文章都是翻译了一半就翻不下去了,E文水平有待提高啊 Scott Watermasysk承诺的文档到现在还没有Release,sigh
聚类算法DBScan共享
Seraph · 2008-09-17 · via 博客园 - Seraph

DBSCAN 简单来说就是一种基于密度的聚类算法。
数据输入支持weka数据格式,里面有一个例子数据,结果与weka比较过,是相同的。
网上有一个DBSCAN的C#的源码,但是错的。
现无偿贡献出来,没有任何协议,可以随意引用,当然了,能发封信给我或者发个明信片啥的更好了,嘿嘿。

下载

From Weikpedia

Steps

DBScan requires two parameters: epsilon (eps) and minimum points (minPts). It starts with an arbitrary starting point that has not been visited. It then finds all the neighbour points within distance eps of the starting point.

If the number of neighbors is greater than or equal to minPts, a cluster is formed. The starting point and its neighbors are added to this cluster and the starting point is marked as visited. The algorithm then repeats the evaluation process for all the neighbours recursively.

If the number of neighbors is less than minPts, the point is marked as noise.

If a cluster is fully expanded (all points within reach are visited) then the algorithm proceeds to iterate through the remaining unvisited points in the dataset.

Pseudocode

  C = 0
  for each unvisited point P in dataset D
     N = getNeighbors (P, epsilon)
     if (sizeof(N) < minPts)
        mark P as NOISE
     else
        ++C
        mark P as visited
        add P to cluster C
        recurse (N)

Advantages

  1. DBScan does not require you to know the number of clusters in the data a priori. Compare this with k-means.
  2. DBScan does not have a bias towards a particular cluster shape or size. Compare this with k-means.
  3. DBScan is resistant to noise and provides a means of filtering for noise if desired.

Disadvantages

  1. DBSCAN can only result in a good clustering as good as its distance measure is in the function getNeighbors(P,epsilon). The most common distance metric used is the euclidean distance measure. Also any other arbitrary distance measure can be used to influence the behaviour of the resulting cluster.
  2. DBScan does not respond well to data sets with varying densities so called hirachical data sets.

Compare to other clustering algorithms, such as K-Means, DBSCAN does not scale very well in data volumen. If more data points are added to the clustering data set the slower the algorithms gets due to the nature of the getNeighbors(P,epsilon) function.

Nearest Neighborhood Detection

A nearest neighborhood detection takes place in the getNeighbors(P,epsilon) funktion. For a data point P all data points must be detected that are within the range of epsilon, based on the distance function used in the algorithm. The detection requires that a distance matrix is calculated for the whole data set. The generation of the distance matrix has a complexity of O((n2-n)/2) since only an upper matrix is needed. Within the distance matrix the nearest neighbors can be detected by selecting a tuple with minimums functions over the rows and columns. Research has been pushing the neighborhood detection into traditional databases to enhance the speed of the detection. Databases solve the neighborhood problem with indexes specifically designed for this type of application

Parameter Estimation

Every data mining task has the problem of parameters. Every parameter influences the algorithm in sepcifc ways. For DBSCAN the parameters epsilon and MinPnts are needed. The parameters must be specified by the user of the algorithms since other data sets and other questions require differnt parameters. An initial value for epsilon can be determined by a k-distance graph. As a thumbrule k is determined with D (number of dimensions in the data set): k=D+1.

Although this parameter estimation gives a sufficient inital parameter set the resulting clustering can turn out to be not the expected partitioning. Therfore research has been performed on incrementally optimising the parameters agains a specific target value.

References

  • "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise". Proceedings of 2nd International Conference on KDD, AAAI Press. Retrieved on 2007-10-15.
  • "Experiments in Parallel Clustering with DBSCAN". Euro-Par 2001: Parallel Processing: 7th International Euro-Par Conference Manchester, UK August 28-31, 2001, Proceedings, Springer Berlin. Retrieved on 2004-02-19.