위계적 군집 분석

군집분석

1.군집분석이란

2.군집분석의 전형적인 11 단계

3.가장 대중적인 군집화 방법 두 가지

4.위계적 군집 분석 예시

5.k-means 군집화 예시

6.PAM 군집화 예시

4.위계적 군집 분석 예시

1)데이터 불러오기

library(flexclust)

data(nutrient)

str(nutrient)

## 'data.frame': 27 obs. of 5 variables:

## $ energy : int 340 245 420 375 180 115 170 160 265 300 ...

## $ protein: int 20 21 15 19 22 20 25 26 20 18 ...

## $ fat : int 28 17 39 32 10 3 7 5 20 25 ...

## $ calcium: int 9 9 7 9 17 8 12 14 9 9 ...

## $ iron : num 2.6 2.7 2 2.6 3.7 1.4 1.5 5.9 2.6 2.3 ...

nutrient(27개 영양소 데이터셋)

-energy: 칼로리

-protein: 단백질

-fat: 지방

-calcium: 칼슘

-iron: 철분

2)데이터 척도화

nutrient.scaled <- scale(nutrient)

head(nutrient.scaled)

## energy protein fat calcium iron

## BEEF BRAISED 1.3101024 0.2352002 1.2897287 -0.4480464 0.1495365

## HAMBURGER 0.3714397 0.4704005 0.3125618 -0.4480464 0.2179685

## BEEF ROAST 2.1005553 -0.9408009 2.2668955 -0.4736761 -0.2610553

## BEEF STEAK 1.6559256 0.0000000 1.6450621 -0.4480464 0.1495365

## BEEF CANNED -0.2708033 0.7056007 -0.3092717 -0.3455273 0.9022882

## CHICKEN BROILED -0.9130462 0.2352002 -0.9311051 -0.4608612 -0.6716471

변수의 단위가 다르기 때문에 척도화를 진행한다.

3)거리 계산

d <- dist(nutrient.scaled)

dist()함수를 사용하여 거리를 계산한다. 이 함수의 디폴트 방식은 유클리드 거리이다.

방식을 바꿀때는 dist(x, method = “manhattan”) 처럼 사용한다.

4)군집 알고리즘 선택

(fit.avergage <- hclust(d = d, method = 'average'))

## Call:

## hclust(d = d, method = "average")

## Cluster method : average

## Distance : euclidean

## Number of objects: 27

위계적 군집 방식에서는 hclust()함수를 사용하고 method인자로 알고리즘을 수정할 수 있다.

plot(fit.avergage, hang = -1, cex = .8, main = 'Average Linkage Clustering')

유클리드 거리 계산으로 평균연결법을 사용해서 만들어진 덴드로그램이다.

5)제시할 군집 수를 정함

library(NbClust)

nc <- NbClust(data = nutrient.scaled, distance = 'euclidean', min.nc = 2, max.nc = 15, method = 'average')

## *** : The Hubert index is a graphical method of determining the number of clusters.

## In the plot of Hubert index, we seek a significant knee that corresponds to a

## significant increase of the value of the measure i.e the significant peak in Hubert

## index second differences plot.

## *** : The D index is a graphical method of determining the number of clusters.

## In the plot of D index, we seek a significant knee (the significant peak in Dindex

## second differences plot) that corresponds to a significant increase of the value of

## the measure.

## *******************************************************************

## * Among all indices:

## * 4 proposed 2 as the best number of clusters

## * 4 proposed 3 as the best number of clusters

## * 2 proposed 4 as the best number of clusters

## * 4 proposed 5 as the best number of clusters

## * 1 proposed 9 as the best number of clusters

## * 1 proposed 10 as the best number of clusters

## * 2 proposed 13 as the best number of clusters

## * 1 proposed 14 as the best number of clusters

## * 4 proposed 15 as the best number of clusters

## ***** Conclusion *****

## * According to the majority rule, the best number of clusters is 2

## *******************************************************************

barplot(

table(nc$Best.nc[1, ]),

xlab = 'Number of clusters',

ylab = 'Number of criteria',

main = 'Number of clusters chosen by 26 criteria')

군집의 수가 2, 3, 5, 15개가 적절하다고 나왔다. 이 중에서 고르면 된다.

6)최종 군집화 솔루션 획득

clusters <- cutree(tree = fit.avergage, k = 5)

군집의 수를 5개로 정하고 적용하는 함수로 cutree() 사용

7)결과 시각화

plot(fit.avergage, hang = -1, cex = .8, main = 'Average linkage clustering\n5 Cluster solution')

( rectClust <- rect.hclust(fit.avergage, k = 5) )

## [[1]]

## SARDINES CANNED

## 25

## [[2]]

## CLAMS RAW CLAMS CANNED

## 17 18

## [[3]]

## BEEF HEART

## 8

## [[4]]

## BEEF BRAISED BEEF ROAST BEEF STEAK

## 1 3 4

## LAMB SHOULDER ROAST SMOKED HAM PORK ROAST

## 10 11 12

## PORK SIMMERED

## 13

## [[5]]

## HAMBURGER BEEF CANNED CHICKEN BROILED CHICKEN CANNED

## 2 5 6 7

## LAMB LEG ROAST BEEF TONGUE VEAL CUTLET BLUEFISH BAKED

## 9 14 15 16

## CRABMEAT CANNED HADDOCK FRIED MACKEREL BROILED MACKEREL CANNED

## 19 20 21 22

## PERCH FRIED SALMON CANNED TUNA CANNED SHRIMP CANNED

## 23 24 26 27

'R > R 머신러닝' 카테고리의 다른 글

k-means 군집화와 PAM 군집화 예시 (0)	2019.03.13
군집분석 (0)	2019.03.13
인공신경망 (0)	2019.03.13
랜덤포레스트 (0)	2019.03.12
앙상블 (0)	2019.03.12

위계적 군집 분석

1)데이터 불러오기

2)데이터 척도화

3)거리 계산

4)군집 알고리즘 선택

5)제시할 군집 수를 정함

6)최종 군집화 솔루션 획득

7)결과 시각화

'R > R 머신러닝' 카테고리의 다른 글

댓글

티스토리툴바

위계적 군집 분석

1)데이터 불러오기

2)데이터 척도화

3)거리 계산

4)군집 알고리즘 선택

5)제시할 군집 수를 정함

6)최종 군집화 솔루션 획득

7)결과 시각화

'R > R 머신러닝' 카테고리의 다른 글

관련글

댓글

티스토리툴바