반응형
목차
군집분석
1.군집분석이란
2.군집분석의 전형적인 11 단계
3.가장 대중적인 군집화 방법 두 가지
4.위계적 군집 분석 예시
5.k-means 군집화 예시
6.PAM 군집화 예시
4.위계적 군집 분석 예시
1)데이터 불러오기
library(flexclust)
data(nutrient)
str(nutrient)
## 'data.frame': 27 obs. of 5 variables:
## $ iron : num 2.6 2.7 2 2.6 3.7 1.4 1.5 5.9 2.6 2.3 ...
nutrient(27개 영양소 데이터셋)
-energy: 칼로리
-protein: 단백질
-fat: 지방
-calcium: 칼슘
-iron: 철분
2)데이터 척도화
nutrient.scaled <- scale(nutrient)
head(nutrient.scaled)
## energy protein fat calcium iron
## BEEF BRAISED 1.3101024 0.2352002 1.2897287 -0.4480464 0.1495365
## HAMBURGER 0.3714397 0.4704005 0.3125618 -0.4480464 0.2179685
## BEEF ROAST 2.1005553 -0.9408009 2.2668955 -0.4736761 -0.2610553
## BEEF STEAK 1.6559256 0.0000000 1.6450621 -0.4480464 0.1495365
## BEEF CANNED -0.2708033 0.7056007 -0.3092717 -0.3455273 0.9022882
## CHICKEN BROILED -0.9130462 0.2352002 -0.9311051 -0.4608612 -0.6716471
변수의 단위가 다르기 때문에 척도화를 진행한다.
3)거리 계산
d <- dist(nutrient.scaled)
dist()함수를 사용하여 거리를 계산한다. 이 함수의 디폴트 방식은 유클리드 거리이다.
방식을 바꿀때는 dist(x, method = “manhattan”) 처럼 사용한다.
4)군집 알고리즘 선택
(fit.avergage <- hclust(d = d, method = 'average'))
##
## Call:
## hclust(d = d, method = "average")
##
## Cluster method : average
## Distance : euclidean
## Number of objects: 27
위계적 군집 방식에서는 hclust()함수를 사용하고 method인자로 알고리즘을 수정할 수 있다.
plot(fit.avergage, hang = -1, cex = .8, main = 'Average Linkage Clustering')
유클리드 거리 계산으로 평균연결법을 사용해서 만들어진 덴드로그램이다.
5)제시할 군집 수를 정함
library(NbClust)
nc <- NbClust(data = nutrient.scaled, distance = 'euclidean', min.nc = 2, max.nc = 15, method = 'average')
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 4 proposed 2 as the best number of clusters
## * 4 proposed 3 as the best number of clusters
## * 2 proposed 4 as the best number of clusters
## * 4 proposed 5 as the best number of clusters
## * 1 proposed 9 as the best number of clusters
## * 1 proposed 10 as the best number of clusters
## * 2 proposed 13 as the best number of clusters
## * 1 proposed 14 as the best number of clusters
## * 4 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
barplot(
table(nc$Best.nc[1, ]),
xlab = 'Number of clusters',
ylab = 'Number of criteria',
main = 'Number of clusters chosen by 26 criteria')
군집의 수가 2, 3, 5, 15개가 적절하다고 나왔다. 이 중에서 고르면 된다.
6)최종 군집화 솔루션 획득
clusters <- cutree(tree = fit.avergage, k = 5)
군집의 수를 5개로 정하고 적용하는 함수로 cutree() 사용
7)결과 시각화
plot(fit.avergage, hang = -1, cex = .8, main = 'Average linkage clustering\n5 Cluster solution')
( rectClust <- rect.hclust(fit.avergage, k = 5) )
## [[1]]
## SARDINES CANNED
## 25
##
## [[2]]
## CLAMS RAW CLAMS CANNED
## 17 18
##
## [[3]]
## BEEF HEART
## 8
##
## [[4]]
## BEEF BRAISED BEEF ROAST BEEF STEAK
## 1 3 4
## LAMB SHOULDER ROAST SMOKED HAM PORK ROAST
## 10 11 12
## PORK SIMMERED
## 13
##
## [[5]]
## HAMBURGER BEEF CANNED CHICKEN BROILED CHICKEN CANNED
## 2 5 6 7
## LAMB LEG ROAST BEEF TONGUE VEAL CUTLET BLUEFISH BAKED
## 9 14 15 16
## CRABMEAT CANNED HADDOCK FRIED MACKEREL BROILED MACKEREL CANNED
## 19 20 21 22
## PERCH FRIED SALMON CANNED TUNA CANNED SHRIMP CANNED
## 23 24 26 27
반응형
댓글