5.9 Clustering, Distance Methods, and Ordination

5.9.1 Overview

Example: Customer Segmentation

5.9.1.0.1 Clustering

군집화의 기준

동일한 군집에 속하는 개체 (또는 개인) 은 여러 속성이 비슷하고, 서로 다른 군집에 속한 관찰치는 그렇지 않도록 (여러 속성이 비슷하지 않도록) 군집을 구성

군집화를 위한 변수: 전체 개체 (개인) 의 속성을 판단하기 위한 기준
- 인구통계적 변인 (성별, 나이, 거주지, 직업, 소득, 교육 등)
- 구매패턴 변인 (상품, 주기, 거래액 등)

군집분석에서는 관측값들이 서로 얼마나 유사한지, 또는 유사하지 않은지를 측정할 수 있는 측도가 필요하다. - 군집분석에서는 보통 유사성(similarity)보다는 비유사성(dissimilarity)를 기준으로 하며, 거리(distance)를 사용한다.

$x$가 연속형일 때 CA의 위력이 최고로 발휘됨. 유사성의 척도로 거리가 사용되는데, 카테고리컬 변수에는 거리 계산이 불가능하기 때문. 꼭꼭 카테고리컬 변수로 CA를 해야겠다면 지시변수로 대체하여 CA를 시도할 수는 있겠으나, 이는 어느정도 억지로 하는 것이고 오점없는 CA는 아님.

5.9.1.0.2 Distance Measures

거리 (Distance) 라는 함수. CA에서 사용되는 모든 거리는 pairwise 거리.

Euclid 거리 (Euclidean) : 가장 메이저함

p차원 공간에서 주어진 두 점 $\pmb x=(x_1 , \cdots, x_p), \; \; \pmb y=(y_1 , \cdots, y_p)$ 사이의 유클리드 거리는

$ d(x, y) = $

if $p=2$,

{:start=“2”}

Minkowski 거리

$ d(x, y) = { {_{i=1}^p (x_i - y_i)^m} }^{} $

$m=2$일 때 이는 Euclidean과 같아진다. 보통은 m의 값으로 짝수를 많이 씀. 민코프는 결국 Euclidean의 일반화.

{:start=“3”}

Mahalanobis 거리

위에서 A와 B의 거리만을 보는 것이 아니라 위의 점들의 군집의 패턴 또한 고려함. x축과 y축에 해당하는 변수들 사이에 correlation이 있다는 것을 반영함. 중앙의 $S^{-1}$으로 corr 구조를 반영하는 것. 뭔 메커니즘으로? 위 케이스를 생각하면 A는 전체적인 패턴의 연장선 상에서 멀리 있는데, B는 패턴에서 직교해서 벗어나면서 가까이 있음. 따라서 A보다 B가 멀다고 평가 가능.

$ d(x, y) = $

{:start=“4”}

Manhattan 거리

$ d_{Manhattan} (x, y) = {_{i=1}^p x_i - y_i } $

Standardization

CA는 자료 사이의 거리를 이용하여 수행되기 때문에, 각 자료의 단위가 결과에 큰 영향을 미친다. 이러한 문제를 해결하기 위하여, 가장 널리 쓰이는 방법이 표준화 방법이다. 표준화 방법이란 각 변수의 관찰값으로부터 그 변수의 평균을 빼고, 그 변수의 표준편차로 나누는 것이다. 표준화된 모든 변수가 평균이 0이고 표준편차가 1이 된다. 사실상 필수.

Graphical Tools
- Scatter Plot
- Scatter Plot using PCA
- Andrews Plot
- Star Plot
- Chernoff Faces

5.9.2 Hierarchical Clustering

Start with $N$ clusters, each containing a single entity and an $N \times N$ symmetric matrix of distances, $D=\{d_{ik}\}$.
Search the distance Matrix $D$ for the nearest pair of clusters. Let the distance b/w the most similar (가장 거리가 작은) clusters $U$ and $V$ be $d_{UV}$.
Mearge clusters $U$ and $V$. Label the newly formed cluster $(UV)$. Update the entries in the distance Matrix $D$ by squences below. The distance b/w $(UV)$ and other cluster $W$ is denoted by $d_{(UV)W}$.
1. deleting rows and columns corresponding to clusters $U$ and $V$, then
2. adding a row and a column giving the distance b/w $(UV)$ and the remaining clusters.
repeat steps 2 and 3 a total of $N-1$ times. Then, all observations will be in single clusters. Record the identity of clusters that are merged and the levels at which the mergers take place.

5.9.2.0.1 계층적 군집분석 Example

distance Matrix $D$는 $n^2$에 의존하여 변수 숫자가 증가하면 연산 시간도 기하급수적으로 증가.

{:start=“5”} 5. 계층적 분석에서만 덴드로그램을 그릴 수 있음. a graphical tool to illustrate the merges or divisions.

python 라이브러리 함수 기준 총 distance의 70%에서 짤라서 clutser를 판정. color_threshold.

5.9.2.0.2 HCA의 종류

Single Linkage, 단일 연결 (mimum distance, or nearest neighbor)

$ d_{(UV)W} = ( d_{UW}, d_{VW} ) $

{:start=“2”} 2. Complete Linkage, 완전 연결 (maximum distance, or farthest neighbor)

$ d_{(UV)W} = ( d_{UW}, d_{VW} ) $

{:start=“3”} 3. Average Linkage, 평균 연결 (average distance) - 위의 둘이 변동이 너무 심해서 이를 해결하기 위해 제시됨

$ d_{(UV)W} = ( {i=1}^{n{UV}} {j=1}^{n{W}} d_{ij} ) $

{:start=“4”} 4. Centriod Method, 중심점 연결 (For each cluster, compute the centroid)

$ d_{(UV)W} = U V $

{:start=“5”} 5. ~~Ward’s Method~~

bold들이 무난함

5.9.2.0.3 HCA의 장단점

Advantage: - cluster의 수를 알 필요가 없음 - 덴드로그램 통해 군집화 프로세스와 결과물을 표현 가능

Disadvantage: - 계산속도가 느림 - 아웃라이어 (이상치) 가 존재할 경우, 초기 단계에 잘못 분류된 군집은 분석이 끝날때까지 소속 cluster가 변하지 않음 - 아웃라이어에 대한 사전검토 필요, Centroid 방법이 아웃라이어에 덜 민감함

5.9.3 K-means Clustering

K-평균 군집분석법. 사전에 결정된 군집수 $k$에 기초하여 전체 데이터를 상대적으로 유사한 k개의 군집으로 구분한다.

Proceeds: 1. 군집수 k를 결정한다 2. 초기 k개 군집의 중심을 선택한다 (랜덤하게) 3. 각 관찰치를 그 중심과 가장 가까운 거리에 있는 군집에 할당한다. 4. 형성된 군집의 중심 (centroid) 을 계산한다. 5. 3-4의 과정을 기존의 중심과 새로운 중심의 차이가 없을 때까지 반복한다.

5.9.3.0.1 Determination of K

KCA의 결과는 초기 군집수 k의 결정에 민감하게 반응한다.

여러가지의 k값을 선택하여 CA를 수행한 후 가장 좋다고 생각되는 k값을 이용.
- Elbow point 계산하여 k 선택
- Silhouette plot으로 k 선택
자료의 시각화를 통하여 K를 결정 (ex. star plot을 2차원 df로 바꾸어 평균 체크했었음)
- 자료의 시각화를 위해서는 차원축소가 필수적이고, 이를 위하여 PCA가 널리 사용된다.
대용량 데이터에서 sampling한 데이터 (이것이 스몰데이터가 됨) 로 HCA를 우선 수행하여 (여기서 덴드로그램이 얻어짐) k의 값을 선택 (즉 HCA와 KCA를 둘 다 쓰므로 hybrid)

5.9.4 군집의 평가방법

Silhouette Score (Silhouette Plot)

$ s(i) = = \[\begin{cases} 1-\dfrac{a(i)}{b(i)}, & if \; \; a(i) < b(i) \\ 0, & if \; \; a(i) = b(i) \\ \dfrac{b(i)}{a(i)} - 1, & if \; \; a(i) > b(i) \end{cases}\]

$a(i)$: 개체 $i$로부터 같은 군집 내에 있는 모든 다른 개체들 사이의 평균 거리. 작을수록 좋다. 작을수록 해당하는 군집 안에서 중앙 부분에 components가 모여 있다는 소리이므로.
$b(i)$: 개체 $i$로부터 다른 군집 내에 있는 개체들 사이의 평균 거리 중 가장 작은 값. 클수록 좋다. 클수록 다른 군집에 헷갈려서 속할 일 없이 확실하게 현재 소속되어 있는 군집에 소속되어 구분된다는 소리이므로.

1을 넘어갈 수 없으며, 1에 가까울수록 군집화가 잘 된 관찰값. 몇개의 cluster가 설정되었을 때 가장 해당 stat이 높게 나오는지를 통해 판정하는 것이 이 접근법. 평균 Silhouette Score는 모든 obs마다 $s(i)$를 구하여 이를 평균낸 값이므로, 평균 Silhouette Score가 1에 가까울수록 군집분석이 잘됐다고 판단 가능.

5.9.5 Clustering using Density Estimation (wk14)

Based on nonparametric density estimation The clusters may be viewed as high-density regions in the space separated by low-density regions between them. No need to specify the number of clusters. It is determined by the method itself.

밀도기반 추정에 요구되는 (hyper) Parameter: bandwidth. 해당 값이 달라지면 결과도 달라짐. Iris 데이터 예

5.9.5.0.1 Kernel Density Estimation (KDE)

$ f(x_0) = _{i=1}^N K ( ) , ; ; ; ; ; x R $

N은 샘플사이즈, 람다는 밴드위스, K는 스무딩 커널, x_i는 obs

closed form처럼 보이지만 그냥 상징적인 공식일 뿐. closed form이 있는게 아니라 데이터 포인트마다 고유한 값이 추정되는 것으로 진행됨.

밀도추정에서 가장 많이 쓰는 방법. 추정하고 싶은 포인트는 $x_0$. $x_0$라는 포인트에 대해 density를 추정하고 싶다. $x_0$ 인근의 관찰치는 더 많은 가중치를 가짐. $x_0$ 로 부터 멀어질수록 가중치는 감소함. 각 obs 별로 커널함수 부여하고 최종적으로 그 커널함수 다 더한 다음에 스케일링하면 끝.

K의 가장 흔한 선택은 정규분포함수, 즉 Gaussian Kernel

Bandwidth 의 효과: 커널함수의 좌우 넓이에 해당하는 것으로서, 가우시안 커널에서는 표준편차에 해당함. Bandwith가 크면 x값들 간에 차별화가 덜되어서 추정 위력이 떨어짐

봉우리의 갯수는 군집의 갯수로 생각할 수 있음. 지나치게 밴드위스가 좁으면 뾰족한 부분이 다수 튀어나와 군집의 과다추정 발생

그래프는 1차원 밀도 추정에 해당 회색: 정답. 표준정규분포 붉은색: undersmoothed, 𝜆𝜆 = 0.05 (too small) 녹색: oversmoothed, 𝜆𝜆 = 2 (too large) 검정색: optimally smoothed, 𝜆𝜆 = 0.337

Bandwidth 추정

2D Kernel Density Estimation: 2차원에서의 KDE는 어떻게 확장될 것인가?

5.9.6 Multidimensional Scaling (MDS)

Dimension Reduction Methods - PCA : x변수들끼리의 분산을 최대화시키는 방향으로 차원축소. 한 변수의 분산이 최대화되어야 함 - Factor: 변수간의 correlation을 최대한 깨트리지 않고 반영하는 방향으로 DR. Corr 구조가 최대한 유지 - MDS - Canonical Discriminant Analysis

이중 위의 둘은 original data의 Variance 설명에 집중함. (ex. 1명이 401호, 1명이 501호에 있다고 하면, 둘의 직선 거리가 그렇게 크게 떨어져있다고 하기는 어렵지만 위의 두 분석법은 멀리 떨어져 있는 것처럼 그래프에 표현될 수 있음. 거리 개념이 없기 땨문)

MDS
- Fit (projection) the original data into a low-dimensional coordinate system such that any distortion caused by a reduction in dimensionality is minimized.
- Map the distances between points in a high dimensional space into a lower dimensional space.
distortion이란? dissimilarity (distance) among the original data points
- For a given set of observed similarities (or distances) between every pair of N items, find a representative of the items in as few dimensions as possible such that the similarities (or distances) in the lower dimensions match, as close as possible with the original similarities (or distances).
1. Nonmetric MDS
- Only the rank orders of the N(N-1)/2 original similarities are used to arrange N items in a lower-dimensional coordinate system. 거리 없이 rank만 주어져있음. rank만 안무너지도록
1. Metric MDS (자주씀. Principal Coordinate Analysis)
- The actual magnitudes of the original similarities are used to obtain a geometric representation.

5.9.6.0.1 Kruskal’s Stress

so-called Badness of fit criterion. MDS가 잘됐다면 기존 오리지널 차원의 거리나 차원축소된 이후의 거리나 비슷해야 함. 크루스칼 스트레스가 작으면 왜곡도 작은 것. 스트레스가 최소인 DR이 최고의 DR.

Let $D_{rs}$ denote the actual distance (or dissimilarity) between item r and item s, then the ordered distances are $D_{r_1 s_1 } <D_{r_2 s_2 } < < D_{r_M s_M }, ; ; ; M=
\[\begin{pmatrix} N \\ 2 \end{pmatrix}\]
$.
Let $d_{rs}$ denote the distance between item r and item s in the lower dimensional space.
MDS seeks (iteratively) to find a set of $d$’s such that $d_{r_1 s_1 } <d_{r_2 s_2 } < \cdots < d_{r_M s_M }$ and $Stress = \left\{ \dfrac{\sum_{i=1}^N \sum_{j=1}^{i-1}(D_{ij} - d_{ij})^2} {\sum_{i=1}^N \sum_{j=1}^{i-1} \left( D_{ij} \right)^2} \right\}^{\tfrac{1}{2}}$ is minimized.
Interpretation Guideline

Stress	Goodness of Fit
20%	Poor
10%	Fair
5%	Good
2.5%	Excellent
0%	Perfect

Goodness of fit = monotonic relationship between the similarities and the final distances.

Takane’s Stress

Stress = { {{i=1}^N {j=1}^{i-1}(D_{ij}2)^2} }^{}

Algorithm: 1. For N items, obtain $M=\dfrac{N(N-1)}{2}$ 개의 distances $D_{r_1 s_1 }, D_{r_2 s_2 } , \cdots , D_{r_M s_M }$. Tehn an $N \times N$ matrix $D = \{D_{ij} \}$ is constructed.

Using a trial configuration in q dimensions, determine distances $d_{ij}^{(q)}$. The method to get initial $d_{ij}^{(q)}$ is given later.
Using the $d_{ij}^{(q)}$, move the points around to obtain an improved configurations.
A new configuration: new $d_{ij}^{(q)}$ and smaller stress (e.g. Newton-Raphson method) The process is repeated until the best (minimum stress) representation is obtained.
Plot minimum stress (q) versus q and choose the best number of dimensions, $q^\ast$ from an examination of this plot. x축은 축소된 차원, y축은 stress. 차원이 작아질수록 Stress는 높고, 차원이 p라면 (original 차원과 같다면) Stress는 0. PCA와 달리 여기서는 elbow에서 멈춤.
- similar to scree plot

Note: 1. The larger the dimension, the better the fit. 2. Higher dimension means harder to visualize.

5.9.6.0.2 Algorithm to find 초기값 $d_{ij}^{(q)}$

q값을 줄이려면 수치해석을 시작하기 전에 넣어줄 초기값에 해당하는 초기좌표들이 필요함. 그 값을 구하는 방법.

Construct the $N \times N$ matrix $A = \{ a_{ij} \} = \left\{ -\dfrac{1}{2} D_{ij}^2 \right\}$.
Construct the $N \times N$ matrix $B = \left(I - \dfrac{1}{N} J \right) A \left(I - \dfrac{1}{N} J \right) = \{ b_{ij} \} = \{ \bar a_{ij} - \bar a_{i.} - \bar a_{.j} + \bar a_{..} \}$.

where $ a_{..} = ^N ^N , ; ; ; ; ; J = \[\begin{bmatrix} 1 & \cdots & 1 \\ \vdots & \ddots & \vdots \\ 1 & \cdots & 1 \end{bmatrix}\]

{:start=“3”}

D행렬은 distance들의 SSE 행렬 정도에 해당.

Since $B$ is a symmetric matrix, use the spectral decomposition to write $B$ in the $B = V \Lambda V'$.
If B is positive semidefinite of rank q (p차원 아님!! $q \le p$. 거리행렬이 일정 ev까지는 유의할 수 있는데 그 후로는 0만 튀어나올 수 있으며 DR은 바로 이상황에서 일어남. p는 위에서 보였던 유사 scree plot에서 original data의 차원으로 지정되었던 숫자) , there are q positive eigenvalues.

_1 = \[\begin{bmatrix} \lambda_1 & \cdots & \pmb 0 \\ & \ddots & \\ \pmb 0 & \cdots & \lambda_q \end{bmatrix}\] _{q q}, ; ; ; ; ; V_1 = \[\begin{bmatrix} \pmb v_1 , \pmb v_2 , \cdots, \pmb v_q \end{bmatrix}\]

_{N q}

then we can express

B = { V_1 }{N q} { 1 }{q q} { V_1 ’ }{q N}

= V_1 _1^{1/2} _1^{1/2} V_1 ’ = ZZ’

where

Z = V_1 _1^{1/2}

= \[\begin{bmatrix} \sqrt{\lambda_1} \pmb v_1 , \sqrt{\lambda_2} \pmb v_2 , \cdots, \sqrt{\lambda_q} \pmb v_q \end{bmatrix}\] = \[\begin{bmatrix} \pmb z_1 ' \\ \pmb z_2 ' \\ \vdots \\ \pmb z_q ' \end{bmatrix}\]

_{N q}

{:start=“4”}

The rows $z_1 ’ , z_2 ’ , , z_q $ of $Z$ are the points whose interpoint distance $d_{ij}^{(q)} = (\pmb z_i - \pmb z_j)'(\pmb z_i - \pmb z_j)$ match $D_{ij} $s in the original distance matrix $D$.
Since $q$ will typically be too large to be of practical interest and we would prefer a smaller dimension $k$ for plotting, we can use the first $k$ eigenvalues and corresponding eigenvectors to obtain $N$ points whose distances $d_{ij}^{(k)}$ are approximately equal to the corresponding $D_{ij}$. 오리지널 데이터의 차원 p가 15개였다면, 이 차원을 SVD했을 때 ev 중 5개가 0이어서 q는 15개로 하였다. 여기서 차원을 더 줄이고 싶다면, 가령 k=5개까지 임의로 내려버리고 싶다면, 뒤쪽의 ev 10개에 해당하는 걸 쳐내는 것

Rank is clearly rank 2. 즉 차원을 2차원까지 줄여도 손실되는 정보가 전혀 없다. 따라서 orginal data Distance Matrix에서 보였던 특성이 그대로 똑같이 드러난다.

Self-Study