k-means 군집화와 PAM 군집화 예시

군집분석

1.군집분석이란

2.군집분석의 전형적인 11 단계

3.가장 대중적인 군집화 방법 두 가지

4.위계적 군집 분석 예시

5.k-means 군집화 예시

6.PAM 군집화 예시

5.k-means 군집화

1)데이터 불러오기

library(rattle)

data(wine)

wine$Type <- factor(wine$Type, levels=c(1,2,3), labels = c('A','B','C'))

str(wine)

## 'data.frame': 178 obs. of 14 variables:

## $ Type : Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 1 1 1 1 ...

## $ Alcohol : num 14.2 13.2 13.2 14.4 13.2 ...

## $ Malic : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...

## $ Ash : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...

## $ Alcalinity : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...

## $ Magnesium : int 127 100 101 113 118 112 96 121 97 98 ...

## $ Phenols : num 2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...

## $ Flavanoids : num 3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...

## $ Nonflavanoids : num 0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...

## $ Proanthocyanins: num 2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...

## $ Color : num 5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...

## $ Hue : num 1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...

## $ Dilution : num 3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...

## $ Proline : int 1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...

2)데이터 척도화

wine.scaled <- scale(wine[-1])

head(wine.scaled)

## Alcohol Malic Ash Alcalinity Magnesium Phenols

## [1,] 1.5143408 -0.56066822 0.2313998 -1.1663032 1.90852151 0.8067217

## [2,] 0.2455968 -0.49800856 -0.8256672 -2.4838405 0.01809398 0.5670481

## [3,] 0.1963252 0.02117152 1.1062139 -0.2679823 0.08810981 0.8067217

## [4,] 1.6867914 -0.34583508 0.4865539 -0.8069748 0.92829983 2.4844372

## [5,] 0.2948684 0.22705328 1.8352256 0.4506745 1.27837900 0.8067217

## [6,] 1.4773871 -0.51591132 0.3043010 -1.2860793 0.85828399 1.5576991

## Flavanoids Nonflavanoids Proanthocyanins Color Hue

## [1,] 1.0319081 -0.6577078 1.2214385 0.2510088 0.3611585

## [2,] 0.7315653 -0.8184106 -0.5431887 -0.2924962 0.4049085

## [3,] 1.2121137 -0.4970050 2.1299594 0.2682629 0.3174085

## [4,] 1.4623994 -0.9791134 1.0292513 1.1827317 -0.4263410

## [5,] 0.6614853 0.2261576 0.4002753 -0.3183774 0.3611585

## [6,] 1.3622851 -0.1755994 0.6623487 0.7298108 0.4049085

## Dilution Proline

## [1,] 1.8427215 1.01015939

## [2,] 1.1103172 0.96252635

## [3,] 0.7863692 1.39122370

## [4,] 1.1807407 2.32800680

## [5,] 0.4483365 -0.03776747

## [6,] 0.3356589 2.23274072

변수의 단위가 다르기 때문에 척도화를 진행한다.

3)제시할 군집 수를 정함

set.seed(1234)

library(NbClust)

nc <- NbClust(data = wine.scaled, min.nc = 2, max.nc = 15, method = 'kmeans')

## *** : The Hubert index is a graphical method of determining the number of clusters.

## In the plot of Hubert index, we seek a significant knee that corresponds to a

## significant increase of the value of the measure i.e the significant peak in Hubert

## index second differences plot.

## *** : The D index is a graphical method of determining the number of clusters.

## In the plot of D index, we seek a significant knee (the significant peak in Dindex

## second differences plot) that corresponds to a significant increase of the value of

## the measure.

## *******************************************************************

## * Among all indices:

## * 4 proposed 2 as the best number of clusters

## * 15 proposed 3 as the best number of clusters

## * 1 proposed 10 as the best number of clusters

## * 1 proposed 12 as the best number of clusters

## * 1 proposed 14 as the best number of clusters

## * 1 proposed 15 as the best number of clusters

## ***** Conclusion *****

## * According to the majority rule, the best number of clusters is 3

## *******************************************************************

barplot(

table(nc$Best.nc[1, ]),

xlab = 'Number of clusters',

ylab = 'Number of criteria',

main = 'Number of clusters chosen by 26 criteria')

군집의 수를 2 ~ 15개로 설정한 후, NbClust 함수를 사용한 결과, 3개의 군집이 적당하다.

4)군집 알고리즘 선택 및 최종 군집화 획득

set.seed(1234)

( fit.km <- kmeans(x = wine.scaled, centers = 3, nstart = 25) )

## K-means clustering with 3 clusters of sizes 62, 65, 51

## Cluster means:

## Alcohol Malic Ash Alcalinity Magnesium Phenols

## 1 0.8328826 -0.3029551 0.3636801 -0.6084749 0.57596208 0.88274724

## 2 -0.9234669 -0.3929331 -0.4931257 0.1701220 -0.49032869 -0.07576891

## 3 0.1644436 0.8690954 0.1863726 0.5228924 -0.07526047 -0.97657548

## Flavanoids Nonflavanoids Proanthocyanins Color Hue

## 1 0.97506900 -0.56050853 0.57865427 0.1705823 0.4726504

## 2 0.02075402 -0.03343924 0.05810161 -0.8993770 0.4605046

## 3 -1.21182921 0.72402116 -0.77751312 0.9388902 -1.1615122

## Dilution Proline

## 1 0.7770551 1.1220202

## 2 0.2700025 -0.7517257

## 3 -1.2887761 -0.4059428

## Clustering vector:

## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2

## [71] 2 2 2 1 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2

## [106] 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3

## [141] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

## [176] 3 3 3

## Within cluster sum of squares by cluster:

## [1] 385.6983 558.6971 326.3537

## (between_SS / total_SS = 44.8 %)

## Available components:

## [1] "cluster" "centers" "totss" "withinss"

## [5] "tot.withinss" "betweenss" "size" "iter"

## [9] "ifault"

k-means 군집 방식에서는 kmeans()함수를 사용하고 nstart 설정으로 25번으 시행하여 예측력을 높인다.

5)예측 잘 되었는지 확인

( ct.km <- table(fit.km$cluster, wine$Type) )

## A B C

## 1 59 3 0

## 2 0 65 0

## 3 0 3 48

6)결과 타당성 검토

library(flexclust)

randIndex(ct.km)

## ARI

## 0.897495

Adjusted Rand Index는 Type과 cluster의 일치 정도를 정량화하는데 사용할 수 있다. -1은 전혀 일치하지 않는 것이고 1은 완벽하게 일치하는 것이다.

6.PAM 군집화

1)데이터 불러오기

library(rattle)

data(wine)

wine$Type <- factor(wine$Type, levels=c(1,2,3), labels = c('A','B','C'))

str(wine)

## 'data.frame': 178 obs. of 14 variables:

## $ Type : Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 1 1 1 1 ...

## $ Alcohol : num 14.2 13.2 13.2 14.4 13.2 ...

## $ Malic : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...

## $ Ash : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...

## $ Alcalinity : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...

## $ Magnesium : int 127 100 101 113 118 112 96 121 97 98 ...

## $ Phenols : num 2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...

## $ Flavanoids : num 3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...

## $ Nonflavanoids : num 0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...

## $ Proanthocyanins: num 2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...

## $ Color : num 5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...

## $ Hue : num 1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...

## $ Dilution : num 3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...

## $ Proline : int 1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...

2)데이터 척도화

wine.scaled <- scale(wine[-1])

head(wine.scaled)

## Alcohol Malic Ash Alcalinity Magnesium Phenols

## [1,] 1.5143408 -0.56066822 0.2313998 -1.1663032 1.90852151 0.8067217

## [2,] 0.2455968 -0.49800856 -0.8256672 -2.4838405 0.01809398 0.5670481

## [3,] 0.1963252 0.02117152 1.1062139 -0.2679823 0.08810981 0.8067217

## [4,] 1.6867914 -0.34583508 0.4865539 -0.8069748 0.92829983 2.4844372

## [5,] 0.2948684 0.22705328 1.8352256 0.4506745 1.27837900 0.8067217

## [6,] 1.4773871 -0.51591132 0.3043010 -1.2860793 0.85828399 1.5576991

## Flavanoids Nonflavanoids Proanthocyanins Color Hue

## [1,] 1.0319081 -0.6577078 1.2214385 0.2510088 0.3611585

## [2,] 0.7315653 -0.8184106 -0.5431887 -0.2924962 0.4049085

## [3,] 1.2121137 -0.4970050 2.1299594 0.2682629 0.3174085

## [4,] 1.4623994 -0.9791134 1.0292513 1.1827317 -0.4263410

## [5,] 0.6614853 0.2261576 0.4002753 -0.3183774 0.3611585

## [6,] 1.3622851 -0.1755994 0.6623487 0.7298108 0.4049085

## Dilution Proline

## [1,] 1.8427215 1.01015939

## [2,] 1.1103172 0.96252635

## [3,] 0.7863692 1.39122370

## [4,] 1.1807407 2.32800680

## [5,] 0.4483365 -0.03776747

## [6,] 0.3356589 2.23274072

변수의 단위가 다르기 때문에 척도화를 진행한다.

3)군집 알고리즘 선택 및 최종 군집화 획득

set.seed(1234)

library(cluster)

( fit.pam <- pam(wine[-1], k = 3, stand = TRUE) )

## Medoids:

## ID Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids

## [1,] 36 13.48 1.81 2.41 20.5 100 2.70 2.98

## [2,] 107 12.25 1.73 2.12 19.0 80 1.65 2.03

## [3,] 175 13.40 3.91 2.48 23.0 102 1.80 0.75

## Nonflavanoids Proanthocyanins Color Hue Dilution Proline

## [1,] 0.26 1.86 5.1 1.04 3.47 920

## [2,] 0.37 1.63 3.4 1.00 3.17 510

## [3,] 0.43 1.41 7.3 0.70 1.56 750

## Clustering vector:

## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 2 1 2 2 2 1

## [71] 2 1 2 1 1 2 2 2 1 1 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 1 1 2 1 2 2 2 2 2 2

## [106] 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 1 1 3 2 1 2 2 2 2 2 3 3 3 3 2 3 3 3 3 3

## [141] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

## [176] 3 3 3

## Objective function:

## build swap

## 3.593378 3.476783

## Available components:

## [1] "medoids" "id.med" "clustering" "objective" "isolation"

## [6] "clusinfo" "silinfo" "diss" "call" "data"

pam 함수의 첫 번째 인자는 데이터이고 k는 군집의 갯수, stand는 거리를 측정하기 전에 변수들을 표준화할 것인지를 나타내는 논리값이다.

4)시각화

clusplot(fit.pam, main="Bivariate Cluster Plot")

Bivariate plot은 각 관측치들을 두개의 주성분을 좌표로 하여 산점도로 나타낸 것이다. 각 군집은 각 군집에 속한 모든 관측치를 포함하는 가장 작은 타원으로 표시되어 있다.

5)예측 잘 되었는지 확인

( ct.pam <- table(fit.pam$cluster, wine$Type) )

## A B C

## 1 59 16 0

## 2 0 53 1

## 3 0 2 47

6)결과 타당성 검토

library(flexclust)

randIndex(ct.pam)

## ARI

## 0.6994957

Adjusted Rand Index는 Type과 cluster의 일치 정도를 정량화하는데 사용할 수 있다. -1은 전혀 일치하지 않는 것이고 1은 완벽하게 일치하는 것이다.

'R > R 머신러닝' 카테고리의 다른 글

위계적 군집 분석 (0)	2019.03.13
군집분석 (0)	2019.03.13
인공신경망 (0)	2019.03.13
랜덤포레스트 (0)	2019.03.12
앙상블 (0)	2019.03.12

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

k-means 군집화와 PAM 군집화 예시

1)데이터 불러오기

2)데이터 척도화

3)제시할 군집 수를 정함

4)군집 알고리즘 선택 및 최종 군집화 획득

5)예측 잘 되었는지 확인

6)결과 타당성 검토

1)데이터 불러오기

2)데이터 척도화

3)군집 알고리즘 선택 및 최종 군집화 획득

4)시각화

5)예측 잘 되었는지 확인

6)결과 타당성 검토

'R > R 머신러닝' 카테고리의 다른 글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

k-means 군집화와 PAM 군집화 예시

1)데이터 불러오기

2)데이터 척도화

3)제시할 군집 수를 정함

4)군집 알고리즘 선택 및 최종 군집화 획득

5)예측 잘 되었는지 확인

6)결과 타당성 검토

1)데이터 불러오기

2)데이터 척도화

3)군집 알고리즘 선택 및 최종 군집화 획득

4)시각화

5)예측 잘 되었는지 확인

6)결과 타당성 검토

'R > R 머신러닝' 카테고리의 다른 글

관련글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역