library(gclus)

data(wine)

dataset <- wine[,-1] #去除分类标签

dataset <- scale(dataset)

1.mclust包

mclust包是聚类分析非常强大的一个包，也是上课时老师给我们介绍的一个包，每次导入时有一种科技感 :) 帮助文档非常详尽，可以进行聚类、分类、密度分析

Mclust包方法有点“暴力”，聚类数目自定义，比如我选取的从1到20，然后一共14种模型，每一种模型都计算聚类数目从1到20的BIC值，最终确定较佳聚类数目，这种方法的思想很直接了当，但是弊端也就显然易见了——时间复杂度太高，效率低。

library(mclust)

m_clust <- Mclust(as.matrix(dataset), G=1:20) #聚类数目从1一直试到20

summary(m_clust)

Gaussian finite mixture model fitted by EM algorithm

Mclust EVE (ellipsoidal, equal volume and orientation) model with 3 components:

log.likelihood n df BIC ICL

-3032.45 178 156 -6873.257 -6873.549

Clustering table:

1 2 3

63 51 64

plot(m_clust, "BIC")

mclust包还可以用于分类、密度估计等，这个包值得好好把玩。

1.维基上的贝叶斯信息准则定义

2.Mclust包中的BIC定义[3]

2.Nbclust包

Nbclust包是我在《R语言实战》上看到的一个包，思想和mclust包比较相近，也是定义了几十个评估指标，然后聚类数目从2遍历到15（自己设定），然后通过这些指标看分别在聚类数为多少时达到最优，最后选择指标支持数最多的聚类数目就是较佳聚类数目。

library(NbClust)

set.seed(1234) #因为method选择的是kmeans，所以如果不设定种子，每次跑得结果可能不同

nb_clust <- NbClust(dataset, distance = "euclidean",

min.nc=2, max.nc=15, method = "kmeans",

index = "alllong", alphaBeale = 0.1)

*** : The Hubert index is a graphical method of determining the number of clusters.

In the plot of Hubert index, we seek a significant knee that corresponds to a

significant increase of the value of the measure i.e the significant peak in Hubert

index second differences plot.

*** : The D index is a graphical method of determining the number of clusters.

In the plot of D index, we seek a significant knee (the significant peak in Dindex

second differences plot) that corresponds to a significant increase of the value of

the measure.

*******************************************************************

* Among all indices:

* 5 proposed 2 as the best number of clusters

* 16 proposed 3 as the best number of clusters

* 1 proposed 10 as the best number of clusters

* 1 proposed 12 as the best number of clusters

* 1 proposed 14 as the best number of clusters

* 3 proposed 15 as the best number of clusters

***** Conclusion *****

* According to the majority rule, the best number of clusters is 3

*******************************************************************

barplot(table(nb_clust\$Best.nc[1,]),xlab = "聚类数",ylab = "支持指标数")

3. 组内平方误差和——拐点图

wssplot <- function(data, nc=15, seed=1234){

wss <- (nrow(data)-1)*sum(apply(data,2,var))

for (i in 2:nc){

set.seed(seed)

wss[i] <- sum(kmeans(data, centers=i)\$withinss)

}

plot(1:nc, wss, type="b", xlab="Number of Clusters",

ylab="Within groups sum of squares")}

wssplot(dataset)

library(factoextra)

library(ggplot2)

set.seed(1234)

fviz_nbclust(dataset, kmeans, method = "wss") +

geom_vline(xintercept = 3, linetype = 2)

km.res <- kmeans(dataset,3)

fviz_cluster(km.res, data = dataset)

4. PAM(Partitioning Around Medoids) 围绕中心点的分割算法

k-means算法取得是均值，那么对于异常点其实对其的影响非常大，很可能这种孤立的点就聚为一类，一个改进的方法就是PAM算法，也叫k-medoids clustering

library(fpc)

pamk.best <- pamk(dataset)

pamk.best\$nc

3

pamk函数不需要提供聚类数目，也会直接自动计算出较佳聚类数，这里也得到为3

library(cluster)

clusplot(pam(dataset, pamk.best\$nc))

5.Calinsky criterion

library(vegan)

ca_clust <- cascadeKM(dataset, 1, 10, iter = 1000)

ca_clust\$results

calinski.best <- as.numeric(which.max(ca_clust\$results[2,]))

calinski.best

3

plot(fit, sortg = TRUE, grpmts.plot = TRUE)

calinski<-as.data.frame(ca_clust\$results[2,])

calinski\$cluster <- c(1:10)

library(ggplot2)

ggplot(calinski,aes(x = calinski[,2], y = calinski[,1]))+geom_line()

Warning message:

"Removed 1 rows containing missing values (geom_path)."

6.Affinity propagation (AP) clustering

AP算法的基本思想是将全部样本看作网络的节点，然后通过网络中各条边的消息传递计算出各样本的聚类中心。聚类过程中，共有两种消息在各节点间传递，分别是吸引度( responsibility)和归属度(availability) 。AP算法通过迭代过程不断更新每一个点的吸引度和归属度值，直到产生m个高质量的Exemplar（类似于质心），同时将其余的数据点分配到相应的聚类中[7]

library(apcluster)

ap_clust <- apcluster(negDistMat(r=2), dataset)

length([email protected])

15

heatmap(ap_clust)

7. 轮廓系数Average silhouette method

a(i)是测量组内的相似度,b(i)是测量组间的相似度，s(i)范围从-1到1，值越大说明组内吻合越高，组间距离越远——也就是说，轮廓系数值越大，聚类效果越好[9]

require(cluster)

library(factoextra)

fviz_nbclust(dataset, kmeans, method = "silhouette")

8. Gap Statistic

library(cluster)

set.seed(123)

gap_clust <- clusGap(dataset, kmeans, 10, B = 500, verbose = interactive())

gap_clust

Clustering Gap statistic ["clusGap"] from call:

clusGap(x = dataset, FUNcluster = kmeans, K.max = 10, B = 500, verbose = interactive())

B=500 simulated reference sets, k = 1..10; spaceH0="scaledPCA"

--> Number of clusters (method 'firstSEmax', SE.factor=1): 3

logW E.logW gap SE.sim

[1,] 5.377557 5.863690 0.4861333 0.01273873 [2,] 5.203502 5.758276 0.5547745 0.01420766 [3,] 5.066921 5.697322 0.6304006 0.01278909 [4,] 5.023936 5.651618 0.6276814 0.01243239 [5,] 4.993720 5.615174 0.6214536 0.01251765 [6,] 4.962933 5.584564 0.6216311 0.01165595 [7,] 4.943241 5.556310 0.6130690 0.01181831 [8,] 4.915582 5.531834 0.6162518 0.01139207 [9,] 4.881449 5.508514 0.6270646 0.01169532 [10,] 4.855837 5.487005 0.6311683 0.01198264

library(factoextra)

fviz_gap_stat(gap_clust)

9.层次聚类

h_dist <- dist(as.matrix(dataset))

h_clust<-hclust(h_dist)

plot(h_clust, hang = -1, labels = FALSE)

rect.hclust(h_clust,3)

10.clustergram

clustergram(dataset, k.range = 2:8, line.width = 0.004)

wine数据集我们知道其实是分为3类的，以上10种判定方法中：

mclust效果很差，14种模型只有6种有结果

bclust报错

SSE可以运行

fpc包中的pamk函数聚成2类，明显不行

Calinsky criterion聚成2类

Affinity propagation (AP) clustering 聚成28类，相对靠谱

gap-Statistic跑不出结果

[1]R语言实战第二版 [2]Partitioning cluster analysis: Quick start guide - Unsupervised Machine Learning [3]BIC：http://www.stat.washington.edu/raftery/Research/PDF/fraley1998.pdf [4]Cluster analysis in R: determine the optimal number of clusters [5]Calinski-Harabasz Criterion：Calinski-Harabasz criterion clustering evaluation object [6]Determining the optimal number of clusters: 3 must known methods - Unsupervised Machine Learning [7] affinity-propagation：聚类算法Affinity Propagation(AP) [8]轮廓系数https://en.wikipedia.org/wiki/Silhouette(clustering)) [9]gap statistic-Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic[J]. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2001, 63(2): 411-423. [10]ClustergramsClustergram: visualization and diagnostics for cluster analysis (R code)

QQ群：418451831

tags: #160,聚类,较佳,BIC,dataset,数目,clust,lt,clusters,number,library,kmeans

1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责；
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性，不作出任何保证或承若；
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。