圓形網格抽樣和逆近鄰優化的密度峰值聚類算法

趙嘉; 何超凡; 肖人彬; 曹浩; 樊棠懷

doi:10.13374/j.issn2095-9389.2025.03.30.001

圓形網格抽樣和逆近鄰優化的密度峰值聚類算法

Density peaks clustering algorithm with circle-division sampling and reverse nearest neighbor optimization

摘要

摘要: 密度峰值聚類(DPC)算法是一種簡單高效的聚類算法，因其可直觀和快速發現數據集中的類簇而得到廣泛關注. 但DPC算法需計算所有樣本間的歐氏距離，算法的時間復雜度較高；局部密度定義未考慮類簇間密度差異影響，易誤選類簇中心；使用鏈式分配策略，易產生錯誤連帶效應. 因此，本文提出一種圓形網格抽樣和逆近鄰優化的密度峰值聚類算法. 該算法采用圓形網格抽樣得到代表以減少需要計算的樣本數，降低算法計算的時間開銷，并引入近似K近鄰策略加強代表和初始樣本的聯系，減少抽樣導致的聚類精度丟失；利用逆近鄰優化局部密度定義策略，根據樣本所處環境調節其局部密度的大小，準確找到密度峰值；通過共享逆近鄰計算相似性，由相似性矩陣分配代表，避免樣本分配策略產生的錯誤連帶效應. 設置了復雜形態合成數據集、真實數據集和較大規模數據集進行分組實驗. 實驗結果表明，本文算法在復雜形態、真實及較大規模數據集上聚類優勢顯著，精度與效率較DPC算法及其他基于DPC的改進算法均有較大提升.

Abstract: The density peak clustering (DPC) algorithm has garnered significant attention in the research community because of its simple, intuitive, and efficient framework for identifying cluster centers in datasets. The core strength of the algorithm lies in its ability to discover clusters of arbitrary shapes by making a fundamental assumption that cluster centers are characterized by a higher local density than their neighbors and are relatively distant from points with higher densities. Despite its effectiveness, the canonical DPC algorithm has several critical limitations that limit its performance and applicability, particularly for complex and large-scale data. First, its computational complexity is prohibitively high, scaling quadratically with the number of samples, because it requires the calculation of a full pairwise Euclidean distance matrix. Second, its definition of local density fails to account for inter-cluster density variations, often leading to erroneous selection of cluster centers in datasets with heterogeneous density distributions. Third, its sequential, chain-like assignment strategy for non-center points is susceptible to a “domino effect,” where a single incorrect assignment can trigger a cascade of subsequent errors, severely compromising the final clustering accuracy. To overcome these deficiencies, this study proposes a novel and robust variant of the DPC algorithm, called the density peak clustering algorithm with circle-division sampling and reverse nearest neighbor optimization (CDPC-RNN). The proposed algorithm systematically enhanced each stage of the DPC process. First, to address the computational bottleneck, we introduced a circle-division sampling method. This strategy effectively generated a set of representative points that preserved the underlying data distribution while substantially reducing the number of samples required for distance calculations, thereby significantly decreasing the time overhead of the algorithm. To mitigate any potential loss of clustering precision resulting from this sampling, an approximate K-nearest neighbor strategy was employed to fortify the topological link between the representative points and original samples. Second, we refined the cluster center identification process by optimizing the definition of local density. By leveraging the concept of reverse nearest neighbors, our approach reevaluated the local density of a point based on its surrounding environment. This adaptive density calculation allowed the algorithm to accurately distinguish true density peaks from spurious ones, particularly in complex datasets where clusters exhibited significant variations in density and scale. Finally, we replaced the fragile chain-like assignment policy with a more robust mechanism. By calculating the similarity between representative points based on their shared reverse nearest neighbors, we constructed a global similarity matrix. The final assignment of points to clusters was guided by this matrix, which circumvent the cascading errors inherent in the sequential approach of the original DPC algorithm. The performance of the proposed algorithm was rigorously evaluated through a series of comparative experiments conducted on diverse datasets, including synthetic datasets with complex morphological structures, real-world benchmark datasets, and large-scale datasets. The experimental results unequivocally demonstrated that our algorithm has a significant advantage over both the original DPC and several other state-of-the-art DPC-based improvements. The proposed CDPC-RNN method achieved substantial enhancements in both clustering accuracy and computational efficiency, establishing it as a powerful and reliable solution for a wide range of clustering tasks.

HTML全文

參考文獻(41)

施引文獻

資源附件(0)