7.2 Descriptive Statistics of Networks

complex system 에 대한 연구에서 연구하는 문제는 대응하는 네트워크 그래프의 구조, 혹은 특성을 분석하는 문제로 동치될 수 있음. 3개의 vertex 들을 묶어 특정 형태의 triplet 을 만들어 triplet 의 특성을 분석. 상품이나 정보의 흐름 분석은 네트워크 분석에서의 path 발생 혹은 비발생 확인 문제와 동치. 각각의 시스템에서 해당 element 의 중요도를 체크하는건 vertex 의 centrality 확인과 동치. 동계통의 community 혹은 그룹을 찾는 문제는 그래프 partitioning 문제와 동치. 이런 네트워크 분석은 순혈 통계와는 살짝 차이가 있음. 보통 수학과 컴퓨터과학, 사회구조 분석사회학, 물리학 등에 의존함.

7.2.1 Vertex and Edge Characteristics

네트워크의 기본 요소는 edge 와 vertex. 이들을 characterization 하고자 하는 작업은 vertex degree 에 기반하며, 이는 각 vertex 가 얼마나 중요한지를 판단하기 위한 측도를 획득하기 위함.

7.2.1.1 Vertex Degree

네트워크 그래프 $G=(V, E)$ 에서 vertex $v$ 의 degree $d_v$ 는 $v$ 에 엮인 edge 의 갯수. $\forall v \in V, d_v = d: f_d \coloneqq \frac{v}{d = d_v}$. collection $\{f_d\}_{d \ge 0}$ 는 $G$ 의 degree distribution 이라고 부른다. 이는 결국 원본 degree sequence 를 rescaling 한 것.

library(sand)

data(karate)
par(mfrow = c(1, 2))
hist(degree(karate), col = "lightblue", xlim = c(0, 50), xlab = "Vertex Degree",
    ylab = "Frequency", main = "")
hist(graph.strength(karate), col = "pink", xlab = "Vertex Strength",
    ylab = "Frequency", main = "")

FIGURE 7.1: Karate Network

weighted network 케이스에서 유용하게 자주 쓰이는 degree 의 일반화는 vertex strength. 이는 해당 vertex 에 연결된 edge 의 weight 를 전부 합한 것. 이러한 strength 의 distribution 은 weighted degree distribution 이라고 불리며, 이는 일반적인 degree distribution 과 입지가 같음.

Figure 2: The vertex strength distribution for the Karate club network - A Network of Interactions among Protein Pairs in Yeast

library(igraphdata)

data(yeast)
ecount(yeast)
## [1] 11855
vcount(yeast)
## [1] 2617

히스토그램을 확인해보자. degree 가 낮은 substantial fraction of vertex 들이 존재한다. 이들의 magnitude 는 karate network 의 그것이랑 유사하지만, 이와 동시에 연속적으로 higher order of magnitude 를 가지는 vertex 들의 숫자가 non-trivial 하게 관측된다. 로그화 시킨 degree 의 경우, log frequency 에서 상당한 linear decay 관측 가능.


par(mfrow = c(1, 2))

d.yeast = degree(yeast)
hist(d.yeast, col = "blue", xlab = "Degree", ylab = "Frequency",
    main = "Degree Distribution")

dd.yeast = degree.distribution(yeast)
d = 1:max(d.yeast) - 1
ind = (dd.yeast != 0)
plot(d[ind], dd.yeast[ind], log = "xy", col = "blue", xlab = c("Log-Degree"),
    ylab = c("Log-Intensity"), main = "Log-Log Degree Distribution")

FIGURE 7.2: The degree distribution for protein interactions in Yeast

Figure 3: The degree distribution for protein interactions in Yeast

서로 다른 degree 의 vertex 들이 서로 연결되어 있다면 그 기저에 깔린 메커니즘은 무엇인가? 이 또한 흥미로운 부분. 이 문제의 해결에 주효하게 작용하는 개념은 주어진 vertex 의 average degree of the neighbors. 높은 degree 의 vertex 는 높은 degree 랑만 붙는 경향이 있는 반면 lower degree 들은 높든 낮든 들러붙는 경향.

# par(mfrow=c(1,1))

a.nn.deg.yeast = graph.knn(yeast, V(yeast))$knn
plot(d.yeast, a.nn.deg.yeast, log = "xy", col = "goldenrod",
    xlab = c("Log Vertex Degree"), ylab = c("Log Average Neighbor Degree"))

FIGURE 7.3: The degree distribution for protein interactions in Yeast

7.2.1.2 Vertex Centrality

vertex 에 대해 갖는 많은 의문은 결국 해당 vertex 가 주어진 네트워크에서 얼마나 중요한가 를 알기 위한 것.

centrality 에 대한 많은 측도 (measure) 들은 이러한 중요성을 측정하기 위해 개발되었음. vertex centrality 를 확인하기 위해 가장 자주 쓰이는 measure 는 vertex degree. 이외에 vertex centrality measure 로서 사용되는 건 Closeness, Betweenness, and Eigenvector 등이 존재. 이 셋이 좀 메이저, 마이저.

vertex centrality 를 나타내는 가장 직관적인 방법은 radial layout 을 쓰는 것. 이는 곧 central vertex 를 중앙에 가깝게 배치하는 것. 물론 네트워크가 너무너무 커버리면 표시불가. 작거나 중간크기 네트워크에만 사용가능.

Closeness centrality

vertex 가 다른 다수의 vertex 와 가깝다면 이를 central 이라고 판정. 일반적으로 사용되는 기준값은:

$$ \[\begin{alignat}{2} &C_{CL} &&=\frac{1}{\sum\limits_{u \in V} dist(u,v)} \tag{closeness} \\ 0 \le & &&=\frac{1}{\sum\limits_{u \in V} dist(u,v)} \cdot (N_v-1) \le 1 \tag{normalized} \end{alignat}\] $$

denominator $dist(v, u)$ 는 $u, v \in V$ 인 vertex $u, v$ 사이의 geodesic distance
- 이의 sum 은 결국 total distance of a vertex from all others.

각각 다른 그래프에서 산출된 centrality measure 를 비교하기 위해 normalize 하는 상황 있으며 위의 factor 곱하면 $[0,1]$ 로 normalize.

large centrality value 는 곧 small total distance 로 이어짐.

Betweenness centrality

어떤 vertex 가 다른 vertex 쌍 사이에 위치하고 있는지를 확인. 이건 vertex 가 네트워크 그래프의 path 에 비추어서 어디에 위치하고 있는지가 중요하다는 관점에 기반. 현실세계에 비추어도 이러한 path 가 인간관계라고 생각한다면 path 가 다수 지나가는 vertex, 즉 인싸는 중요한 사람일 것. 일반적으로 사용되는 값은:

\[ C_B (v) = \sum\limits_{s \not = t \not = v \in V} \frac{\sigma(s,t | v)}{\sigma(s, t)} \]

numerator 는 $v$ 를 통과하면서 $s,t$ 사이가 최단거리인 path 의 총 숫자,
denominator 는 이런 조건 없이 $s,t$ 사이가 최단거리인 path 의 총 숫자.

이를 unit interval 로 scale 할 때는 $\frac{(N_v - 1) (N_v - 2)}{2} = \choose {N_v - 1}2$ 로 나누면 됨.

eigenvector centrality

‘status’, ‘prestige’, ‘rank’ 등에 기반한 사고방식. vertex 의 이웃이 central 하다면, 본인 vertex 자체도 central 하리라는 관점. 이는 central 의 정의만 생각해보더라도 꽤 합리적인 추론임. 이는 적절하게 정의된 방정식의 linear system 의 eigenvalue solution 으로 표현가능함. 이러한 eigenvalue centrality 관점에 쓰이는 값은 꽤 많은데 가장 대표적인 건:

\[ C_{E_i} (v) = \alpha \sum\limits_{\{u,v\}\in E} C_{E_i}(u) \]

What is this?
1. we implement eigen-analysis for an adjacency matrix - $\alpha$: precision
2. find the largest eigenvalue and its corresponding eigenvector - eigenvalue: the importance capture largest variance - eigenvector: dependency information
vector $C_{E_i} = \left ( C_{E_i}(1), \cdots, C_{E_i}(N_v) \right)'$ 는 eigenvalue problem $A_{C_{E_i}}$ 의 solution.
$A$ 는 네트워크 그래프 $G$ 의 adjacency 매트릭스.

$\alpha^{-1}$ 의 optimal choice 는 $A$ 의 가장 큰 eigenvalue. 따라서 $h_{E_i}$ 는 상응하는 eigenvector. $G$ 가 undirected 이며 connected 라면, $A$ 의 largest eigenvalue 는 간단하며 이의 eigenvector는 모두 nonzero entry 이며 부호 같음. 관례 (convention) 적으로는 이 entry 들의 abs 를 보고함. 이러면 eigenvector 의 orthonormality 에 의해 얘들은 자동적으로 $[0,1]$ 사이로 scaled.

library(network)
library(sna)

A = get.adjacency(karate, sparse = FALSE)
g = as.network.matrix(A)

par(mfrow = c(2, 2))

gplot.target(g, degree(g), main = "Degree", circ.lab = FALSE,
    circ.col = "skyblue", usearrows = FALSE, vertex.col = c("blue",
        rep("red", 32), "yellow"), edge.col = "darkgray")
gplot.target(g, closeness(g), main = "Closeness", circ.lab = FALSE,
    circ.col = "skyblue", usearrows = FALSE, vertex.col = c("blue",
        rep("red", 32), "yellow"), edge.col = "darkgray")
gplot.target(g, betweenness(g), main = "Betweenness", circ.lab = FALSE,
    circ.col = "skyblue", usearrows = FALSE, vertex.col = c("blue",
        rep("red", 32), "yellow"), edge.col = "darkgray")
gplot.target(g, eigenvaluecent(g), main = "Eigenvalue", circ.lab = FALSE,
    circ.col = "skyblue", usearrows = FALSE, vertex.col = c("blue",
        rep("red", 32), "yellow"), edge.col = "darkgray")

이러한 centrality measure 를 undirected 에만 적용해왔음. directed 에도 적용못할 이유는 없지.

소위 hub vertex 의 중요성을 정의해보자. 얼마나 많은 authority vertex 를 그들이 향하는지, 그리고 얼마나 많은 authority vertex 들이 해당 vertex 를 향하는지를 통해 판단. directed graph $A$ 가 주어졌을 때 hub 는 $M_{hub} = AA'$ 의 eigenvector centrality 에 의해 결정됨. authority 는 $M_{auth} = A'A$ 에 의해 결정.

l = layout.kamada.kawai(aidsblog)
par(mfrow = c(1, 2))
plot(aidsblog, layout = l, main = "Hubs", vertex.label = "",
    vertex.size = 10 * sqrt(hub.score(aidsblog)$vector))
plot(aidsblog, layout = l, main = "Authorities", vertex.label = "",
    vertex.size = 10 * sqrt(authority.score(aidsblog)$vector))

FIGURE 7.4: AIDS blog network with vertex area proportional to hubs and authority centrality measures

7.2.1.3 Characterizing Edges

vertex betweenness centrality 에서 edge betweenness centrality 는 직관적. 각 edge 에 해당 edge 가 최단거리 path 로 삼아지는 횟수를 고르면 됨.

eb = edge.betweenness(karate)
E(karate)[order(eb, decreasing = T)[1:3]]
## + 3/78 edges from 4b458a1 (vertex names):
## [1] Actor 20--John A   Mr Hi   --Actor 20 Mr Hi   --Actor 32

하지만 이외의 vertex centrality measures 들은 edge 에 적용하려면 그렇게 쉽진 않음. 해결책 중 하나는 네트워크 그래프 $G$ 의 line graph 의 vertex 에 vertex centrality measures 를 적용하는 것. $G$ 의 라인그래프 $G'=(V', E')$ 는 $G$ 의 vertex 를 edge 로, edge 를 vertex 로 바꾸는 것으로 획득됨. vertex $v' \in V'$ 는 원본 그래프의 edge $e \in E'$ 를 의미하며, edge $e' \in E'$ 는 $G$ 에서의 대응하는 원본 vertex 1개의 세트를 의미함.

7.2.2 Characterizing Network Cohesion

• 네트워크 그래프 상에서 edge 로서 정의된 관계에 비추었을 때, vertex 들의 어느 subset 이 어느 정도로 응집되는가, 혹은 같이 들러붙는가 하는 문제에 대해 생각해보는 것이 network cohesion. SNS 에서 친구의 친구끼리는 친구가 되기 쉬운가? 세포에서 어느 protein 들끼리 협업할 가능성이 높은가? 인터넷 토폴로지에서 어느 부분이 “backbone” 을 구성하는가? 다루는 문제가 무엇이냐에 따라 network cohesion 그 자체가 무엇인지 또한 달라짐. local 에서 global (ex. giant component), 어느 정도로 확실하게 정의되는가 (ex. clique) 혹은 두루뭉술한가 (ex. cluster, community) 등 스탯이 다양함.

7.2.2.1 Subgraphs and Censuses

Cliques

complete 서브그래프, 즉 fully cohesive 한 vertex 들의 subset 을 Cliques 라고 통칭. 이인즉 소속된 모든 vertex 들이 edge 로 연결되어 있다는 것. 모든 사이즈의 clique 들에 대한 census 는 그래프가 어떻게 structure 되어 있는지에 대한 ‘snapshot’ 을 제공함.

더 큰 사이즈의 clique 는 필연적으로 작은 사이즈의 cluque 들을 포함함. maximal clique 는 더 큰 clique 의 subset 이 아닌 clique 를 일컫음. 큰 clique 는 본질적으로 드뭄. 큰 clique 의 존재는 결국 원본 그래프 $G$ 가 일정 이상으로 dense 할 것을 요구하니까. 하지만 현실세계 네트워크는 보통 sparse 하거든.

clique size: 1 (node), 2 (edge), 3 (triangle), 4 (안이 크로스된 사각형)

table(sapply(cliques(karate), length))
## 
##  1  2  3  4  5 
## 34 78 45 11  2
cliques(karate)[sapply(cliques(karate), length) == 5]
## [[1]]
## + 5/34 vertices, named, from 4b458a1:
## [1] Mr Hi    Actor 2  Actor 3  Actor 4  Actor 14
## 
## [[2]]
## + 5/34 vertices, named, from 4b458a1:
## [1] Mr Hi   Actor 2 Actor 3 Actor 4 Actor 8
table(sapply(maximal.cliques(karate), length))
## 
##  2  3  4  5 
## 11 21  2  2
clique.number(yeast)
## [1] 23

$k$-core

clique 를 약화시킨 개념. 그래프 $G$ 의 $k$-core 는 $G$ 의 서브그래프 중 모든 vertex degree 가 최소한 $k$ 는 되는 것 서브그래프들 중에서도, 다른 서브그래프들이 $k$-core 와 동일한 condition 을 따르지 않는 것을 $k$-core 라고 말함. 즉, 해당 성질을 보유한 서브그래프들 중 maximal 한 놈. core 라는 개념은 특히 visualization 쪽에서 핫함. 네트워크를 ‘layer’ 로 decomposition 할 방법론을 제공하기 때문.


cores = graph.coreness(karate)
gplot.target(g, cores, circ.lab = FALSE, circ.col = "skyblue",
    usearrows = FALSE, vertex.col = cores, edge.col = "darkgray")

FIGURE 7.5: Visual representation of the k-core decomposition of the karate network

detach("package:sna")
detach("package:network")

core 에 포함된 vertex 들은 center 에서 크게 떨어지지 않았으며, 각 core 에서 일정한 거리를 유지하고 있음이 시각적으로 확인 가능.

Network Cohesion 을 정의함에 있어서 쓰이는 다른 서브그래프들의 class

vertex 2개를 뽑아 만든 쌍 (pair) 를 dyad 라고 부름. directed 그래프에서 dyad 는 3개의 상태를 가질 수 있다.

null (no directed edges)
asymmetric (one directed edge)
mutual (two directed edges)

대부분의 dyad 는 보통 null 이며, non-null 중에서도 대부분은 aymmetric 이다. 후자의 경우는 블로그에서 서로이웃이 아니라 한쪽만 팔로잉하는 경우겠지.

vertex 3개를 뽑아 만든 모음은 Triad. 이는 16개의 상태를 가질 수 있음. null 서브그래프부터, triad 에 속한 모든 vertex 들이 mutual directed edge 를 보유하는 서브그래프 까지.

aidsblog = simplify(aidsblog)
dyad.census(aidsblog)
## $mut
## [1] 3
## 
## $asym
## [1] 177
## 
## $null
## [1] 10405

해당 데이터를 살펴보면 hub 와 authority 에 대한 기존 지식과 궤를 같이함을 확인 가능. Small connected subgraphs of interest are commonly termed motifs. motif 라는 개념은 생물 네트워크에서 두드러지게 유명한 개념, 생태계 substructure 를 biological function 과 연결지을때 자주 쓰임.

7.2.2.2 Density and Related Notions of Relative Frequency

Density

그래프의 Density 는 potential, 즉 잠재적으로 발생할 edge 대비 실제로 발생한 edge 들의 빈도. 예를 들어 (undirected) 그래프 $G$ 가 self-loop 가 없고 multiple edge 도 없다고 할 때, 서브그래프 $H=(V_H, E_H)$ 의 density 는

\[ \begin{align} 0 \le den(H) &= \frac{|E_H|}{\frac{|V_H|(|V_H|-1)}{2}}= \frac{\text{# of edges}}{\text{# of dyads}} \le 1 \tag{undirected} \\ &= \frac{\phantom{|E_H|}}{|V_H|(|V_H|-1)} \tag{directed} \begin{alignat} \]

해당 값은 $[0,1]$ 에 존재하며 $H$ 가 clique 가 되기까지의 역치에 얼마나 가까운지에 대한 측도 (measure) 를 제공함.

$H=G$ 인 상황이라면 전체 그래프 $G$ 에 대한 density 를 생산. 반대로 vertex $v \in V$ 의 neighbor 의 set $H_v=H$ 가 되게 한다면, 이들 사이의 edge 는 $v$ 의 immediate 이웃의 density 의 측도 (measure) 을 생산함. immediate 이웃의 합집합으로만 생산한 ego-centric 네트워크 는 원본의 overall 네트워크보다 명백히 dense 함.

ego.instr = induced.subgraph(karate, neighborhood(karate, 1,
    1)[[1]])
ego.admin = induced.subgraph(karate, neighborhood(karate, 1,
    34)[[1]])
graph.density(karate)
graph.density(ego.instr)
graph.density(ego.admin)

Clustering Coefficients (transitivity)

일반적으로 이하를 일컫는다.

\[ cl_T (G) = \frac{3\tau_\Delta (G)}{\tau_3 (G)} \]

$\tau_\Delta (G)$: 그래프 $G$ 안에 있는 모든 triangle 의 숫자
$\tau_3 (G)$: connected triple, 즉 3개의 vertex 에 2개의 edge 가 놓여있는 (i.e., 2-star) vertex 들로 만든 서브그래프의 숫자.

이 Clustering Coefficients 인 $cl_T(g)$ 는 그래프의 transitivity 라고 불리기도 함. 이는 소셜 네트워크 문헌에서 일반적으로 관심을 갖는 변량 중 하나임. 다른 말로 fraction of transitive triples 라고도 불림. $cl_T(g)$ 는 global clustering 의 measure 이며, connected triple 이 triangle 을 형성하기까지에 얼마나 가까운지에 대한 상대적 빈도를 서술함.

transitivity(karate)
## [1] 0.2556818
transitivity(karate, "local", vids = c(1, 34))
## [1] 0.1500000 0.1102941

Reciprocity

directed graph 에 한정된 개념. reciprocated (mutual) 한 edge 의 숫자를 총 edge 의 숫자로 나눈 것. 이는 single, unreciprocated 한 dyad 대비 reciprocated dyad 의 비중을 나타냄.

reciprocity(aidsblog, mode = "default")
## [1] 0.03278689
reciprocity(aidsblog, mode = "ratio")
## [1] 0.01666667

7.2.2.3 Connectivity, Cuts, and Flows

기본적으로 궁금한 건 주어진 그래프가 서로 다른 서브그래프로 쪼개질 수 있나 하는 것. 불가능하다면 해당 그래프가 이 쪼개질 수 있는 성질의 역치에 얼마나 가까운지를 체크하는 것이 목적이 된다.

만약 모든 vertex가 다른 모든 vertex에서 접근 가능하다면, 즉 adjacency Matrix가 diag 제외하고 모두 1이면, 그래프 $G$는 connected라고 칭해진다.

그리고 그래프의 connected component는 maximally connected 서브그래프이다.

그래프 $G$의 connected component 중 하나가 다른 모두를 위력에서 압도한다면, 이는 곧 해당 connected component가 $G$의 대부분의 vertex를 포함하고 있다는 이야기. 이러한 component는 giant component라고 불리며 이는 random graph theory 출신 용어.

is.connected(yeast)
## [1] FALSE
comps = decompose.graph(yeast)
table(sapply(comps, vcount))
## 
##    2    3    4    5    6    7 2375 
##   63   13    5    6    1    3    1

결과는 false로 나오지만 이에 대해 census 돌리면 giant component의 존재 확인 가능. 아래 예시의 경우 component 1개가 2375/2617로 90퍼 vertex랑 연결중임. 이는 현실 네트워크에서의 small world property와 연결. vertex 쌍들 collection에서의 minimum path는 보통 되게 작음. 대비되게 clustring은 상대적으로 높음. (ex) protein?

small world property: high connectivity b/w pairs of nodes
- small shortest-path distance
- high clustering coefficient

yeast.gc = decompose.graph(yeast)[[1]]
average.path.length(yeast.gc)
## [1] 5.09597
diameter(yeast.gc)
## [1] 15
transitivity(yeast.gc)
## [1] 0.4686663

해당 네트워크에서의 shortest path는 $N_v$보다 $\log N_v$로 표현되는게 정확할 정도로 짧음. scales more like, thus considered small. 동시에 해당 네트워크에서의 clustering은 상대적으로 large, 이는 transitivity로 확인 가능.

Connectivity
1. $k$-vertex-connected
  - the number of vertices $N_v > k$
  - cardinality $|X|<k$ 이며 $X \subseteq V$ 인 vertex의 subset $X$ 을 지우면 connected subgraph가 아니게 됨.
2. $k$-edge-connected
  - $N_v ≥ 2$
  - cardinality $|Y|<k$이며 $Y \subseteq E$인 edge의 subset $Y$을 지우면 connected subgraph가 아니게 됨.

위의 조건에 따라 그래프 $G$ 는 $k$-vertex-connected 혹은 $k$-edge-connected.

즉 $G$의 vertex (edge) connectivity는 $G$의 k-vertex(k-edge-) connected가 유지되는 가장 큰 integer. 이때 vertex connectivity $\le$ edge connectivity $\le$ minimum degree among vertex in $G$ (dmin). 따라서 이 서브그래프를 추가적인 component로 분해하기 위해서는 단 1개의 엄선된 vertex나 edge를 제거하는 것으로 충분하다.

vertex.connectivity(yeast.gc)
## [1] 1
edge.connectivity(yeast.gc)
## [1] 1

Cut

vertex (edge)의 subset $S$를 제거하는 것으로 해당 그래프가 서브그래프로 조각난다면, $S$는 vertex-cut (edge-cut). 여기서 vertex $S$의 원소가 1개라면, 즉 vertex 1개만을 제거한 것으로 그래프가 조각났다면, 이는 cut vertex, 혹은 articulation point. 이러한 vertex의 여부를 식별하는 건 해당 네트워크가 외부 공격에 취약하는지를 파악하는데 도움이 됨. 해당 포인트 끊기면 네트워크 정상작동이 안되니까.

Identification of such vertices can provide a sense of where a network is vulnerable (e.g., in the sense of an attack, where disconnecting produces undesired consequences, such as a power outage in an energy network).
In the giant component of the yeast network, almost 15% of the vertices are cut vertices.

yeast.cut.vertices = articulation.points(yeast.gc)
length(yeast.cut.vertices)
## [1] 350

nontrivial 그래프 $G$는 $k$-vertex (k-edge) connected $\iff$ 서로다른 vertex의 쌍 $u, v \in V$가 $k$ vertex-disjoint (edge-disjoint) paths에 의해 connected 가능.

이 결과는 그래프에서 특정 vertex (edge)가 제거된 상황에서도 그래프 내부에서 만들어지는 서로 다른 path 들이 얼마나 많은지를 통해 평가되는 그래프의 robust함과 연결되어 있다. 낮은 vertex (edge) connectivity 를 가지는 그래프는 따라서 path들을 가질 수 있으며, 이에 의해 그 path들을 통과했던 “information”들은 작은 숫자의 vertex (edge)를 없애는 것만으로 쉽게 방해되고 만다.

shortest.paths()
graph.maxfow()
graph.mincu()

7.2.3 Graph Partitioning

Partitioning은 elements의 집합을 “발생이 자연스러운” 부분집합으로 분할하는 과정. 더 이론적으로 말하자면, finite set $S$의 partition $C = \{ C_1, \cdots, C_K \}$는 $S$를 $K$ 개의 disjoint로 decomposition 한 물건으로, 이인즉 $\forall C_k \not = \emptyset: \cup_{k=1}^K C_k = S$.

네트워크 그래프 분석에서, partitioning은 겉으로 드러나지 않는 관계성 측면에서 vertex의 묶음이 cohesiveness를 가지고 있는지를 확인하기에 유용한 방법이다. vertex의 cohesive한 subset은 일반적으로 이하와 같은 걸 일컬음:

subset 내부에서, 동시에, 잘 connected 되어 있어야 한다
subset 외부, 즉 남아있는 vertex들과 잘 seperated - 연결성이 없음

Graph partitioning algorithms 은 보통 그래프 $G(V, E)$의 vertex set $V$ 의 partition $C = \{ C_1, \cdots, C_K \}$를 찾는 것을 그 목표로 함. 이를 위한 방법으로 $C_k$ 안의 vertex에서 $C_k'$로의 vertex로 잇는 edge의 sets $E(C_k, C_k ')$는 $C_k$ 내에서 vertex 를 잇는 edge들의 set $E(C_k) = E(C_k , C_k)$보다 작다는 점을 활용함.

그래프 partitioning의 이 문제는 complex networks 문헌에서의 community detection에서도 동일하게 발생함. 이에 대한 해결책으로 큰 틀에서 2가지 접근법 이 존재.

7.2.3.1 Hierarchical Clustering

그래프 파티셔닝에 사용되는 대부분의 방법은 본질적으로 Hierarchical Clustering의 변용에 불과함. 여러가지 방법론이 제시되었지만, 그 차이는 결국 이하가 다를 뿐임.

proposed clusterings의 quality를 어떻게 측정하는가
연구자가 찾고 있는 해당 quality를 어떻게 최적화하는가. 보통 greedy algorithm 으로 모든 가능한 partition $C$의 space를 탐색하는 식으로 한다. 이 과정에서 계속해서 후보 partition을 갱신하고.

Hierarchical methods 는 다음 둘로 분류됨.

agglomerative, 파티션을 합쳐나가는 것을 계속해나가는 것으로 크기를 키워가는 것에 기반 (coarsen)
divisive, 파티션을 쪼개나가는 것을 계속해나가는 것으로 연속으로 다듬어나가는 것

각 단계에서 현재의 후보 partition은 지정된 비용 측정값을 최소화한다는 목적으로 계속해서 정제되어 갑니다.

agglomerative 방법에서는, 2개의 이전의 partition elements 중 가장 저렴한 merge 방법이 실행된다
divisive 방법에서는, 1개의 이전의 partition 중 가장 저렴하게 2개로 split 할 수 있는 방법이 실행된다

비용측정의 기준은 vertex의 cohesive subset을 뭘 기준으로 판정할지 하는 연구자의 주관이 개입됨. 메이저한 기준은 modularity. 계산은 이하와 같다:

$C = \{ C_1, \cdots, C_K \}$를 주어진 (given) 후보 (candidate) partition 으로 하자
$f_{ij} = f_{ij}(\mathcal C)$는 $C_i$ 에 있던 vertex 들을 $C_j$ 에 있는 vertex 들과 연결시키는 (오리지널 네트워크의) edge 들의 fraction

이때 $\mathcal C$의 modularity는

\[ \mod(\mathcal C) = \sum_{k=1}^K \left[ f_{kk}(\mathcal C) - f_{kk}^\ast \right] \tag{modularity} \]

$f_{kk}$ 는 within $C_k$ 에서의 observed connections.
$f_{kk}^\ast$는 random edge assignment의 몇몇 모델 이하에서의 $f_{kk}$의 기댓값.
- $f_{kk}^\ast$는 $f_{k+} \cdot f_{+k}$이며 각각 $f$의 k번째 rowsum과 colsum. 즉 $f_{ij}$를 entry로 하는 $K \times K$ 매트릭스가 만들어짐.

This choice corresponds to a model in which a graph is constructed to have the same degree distribution as $G$, but with edges otherwise placed at random, without respect to the underlying partition elements dictated by $C$.

In principle the optimization of the modularity requires a search over all possible partitions C, which is prohibitively expensive in networks of moderate size and larger. • A fast, greedy approach to optimization has been proposed, in the form of an agglomerative hierarchical clustering algorithm, and implemented in igraph as fastgreedy.community. • The result of this and related community detection methods in igraph is to produce an object of the class communities, which can then serve as input to various other functions.

Applying this method to the karate network,

kc = fastgreedy.community(karate)
length(kc)
## [1] 3
sizes(kc)
## Community sizes
##  1  2  3 
## 18 11  5
head(membership(kc))
##   Mr Hi Actor 2 Actor 3 Actor 4 Actor 5 Actor 6 
##       2       2       2       2       3       3

high modularity value nontrivial group ???

The largest community of 18 members is centered around the administrator (i.e., John A, vertex ID 34). • The second largest community of 11 members is centered around the head instructor (i.e., Mr Hi, vertex ID 1).

plot(kc, karate)

FIGURE 7.6: Partitioning of the Karate network obtained from hierarchical clustering

Figure 9: Partitioning of the Karate network obtained from hierarchical clustering

• Whether agglomerative or divisive, when used for network graph partitioning, hierarchical clustering methods actually produce, as the name indicates, an entire hierarchy of nested partitions of the graph, not just a single partition. • The resulting hierarchy typically is represented in the form of a tree, called a dendrogram.

library(ape)

dendPlot(kc, mode = "phylo")

FIGURE 7.7: The corresponding dendrogram for this partitioning

7.2.3.2 Spectral Partitioning

Calculate Laplacian Grpah

\[ L = \underbrace{D}_{\text{diagonal matrix with degree}}-\underbrace{A}_{\text{Adjacency Matrix}} \\ = \begin{bmatrix} degree(n_1) & & 0 \\ & \ddots & \\ 0 & & degree(n_r) \end{bmatrix} -A \]

Eigenvalue decomposition of $L$

\[ L = \underbrace{\begin{bmatrix} v_1 & \cdots & v_n\end{bmatrix}}_{eigenvector} \underbrace{\begin{bmatrix} \lambda_1 & & 0 \\ & \ddots & \\ 0& & \lambda_n \end{bmatrix}}_{eigenvalue} \begin{bmatrix} v_1 \\ \vdots \\ v_n\end{bmatrix} \]

sort eigenvalue $\lambda_1 > \lambda_2 > \cdots > \lambda_n$. Is $L$ full rank? No. Why? cause $L$ is a representation of a network. Smallest eigenvalue can be shown to be identically zero and its corresponding eigenvecotr 1. Then? # of component in a grpah is directly related to # of non-zero eigenvalue. Which means $\lambda_{N-1} \approx 0 \Rightarrow K=2$, $\lambda_{N-1} \approx \lambda_{N-2} \approx 0 \Rightarrow K=3$³
select $K$ eigenvector. $v_{n-1}, \cdots, v_{n-k}$.
apply $k$-means clustering to $k$ selected eigenvector.

spectral graph theory의 연구결과를 응용하여 그래프 $G$의 connectivity를 특정 매트릭스의 eigen-analysis와 연관짓는 것.

adjacency matrix $A$에 대한 그래프 $G$의 그래프 Laplacian 은 $L = D − A$이며, 이때 $D = diag[(D_{vv} = d_v)]$, $d_v$는 $G$의 entries of the degree sequences.

spectral graph theory의 결과를 통해 우리는 다음을 파악 가능.

그래프 $G$는 $K$ 개의 connected components로 구성 $\iff$ $\lambda_1 (L) = \cdots = \lambda_K(L) = 0$ 이며 $\lambda_{K+1}(L)>0$, where $\lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_{N_v}$들은 L의 (not necessarily distinct) eigenvalue이며, ordered from small to large.

그래프 안의 component의 숫자는 그래프 Laplacian의 non-zero eigenvalue의 숫자과 직접적으로 연관되어 있음. $L$의 최소 eigenvalue는 0임을 바로 보일 수 있다. eigenvector $x_1 = (1,\cdots,1)'$에 대응하므로. 따라서 우리가 그래프 $G$가 “거의” $K=2$ 개의 component들로 구성되어 있다고 추론한다면, 즉슨 2개로 쪼개기에 적합하다고 생각한다면, 이는 곧 우리는 해당 대상에 대해 $\lambda_2(L)$가 0에 가까울 것이라고 추론할 것이라는 것과 동치이다. 이러한 추론은 $\lambda_2$가 그래프 connectivity와 structure의 측정치의 값과 깊은 연관이 있기에 합리적이다. 특히 이러한 관계성은 $\lambda_2$가 0에 가까울 수록 서브그래프 A과 서브그래프 B 사이를 통과하는 edge가 적을 것이기에 이렇게 둘로 쪼개는 것이 합리적일 것임을 보여준다. $\lambda_2$를 그래프의 connectivity와 연관지은 제언자는 대응하는 eigenvector $x_2$ 안의 entries들의 부호에 따라 vertex들을 쪼개는 것을 주장했다. 결과는 다음과 같다:

\[ S = \{v \in V: x_2 (v) \ge 0 \} \\ \bar S = \{v \in V: x_2 (v) < 0 \} \]

즉, 2개의 vertex의 subset이 생산되며 (이를 보통 cut이라고 부름), 이 벡터 $x_2$는 보통 Fiedler Vector라고 불리며 이에 대응하는 eigenvalue $\lambda_2$는 Fiedler Value라고 부른다.

k.lap = graph.laplacian(karate)
eig.anal = eigen(k.lap)
plot(eig.anal$values, col = "blue", ylab = "Eigenvalues of Graph Laplacian")

FIGURE 7.8: Eigenvalues of Graph Laplacian

We plot the eigenvalues of the graph Laplacian.

0인 eigenvalue는 딱 하나. (해당 네트워크는 connected이므로 예상한 결과)
2번째로 작은 eigenvalue인 $\lambda_2$는 0에 매우 가까움.

f.vec = eig.anal$vectors[, 33]  #Extracting the Fiedler vector
faction = get.vertex.attribute(karate, "Faction")
f.colors = as.character(length(faction))
f.colors[faction == 1] = "red"
f.colors[faction == 2] = "cyan"
plot(f.vec, pch = 16, xlab = "Actor Number", ylab = "Fiedler Vector Entry",
    col = f.colors)
abline(0, 0, lwd = 2, col = "lightgray")

FIGURE 7.9: Fiedler vector and its corresponding partition

Fiedler vector를 생산하고 해당 vector의 요소들을 실제 actor number에 따라 배정한 그래프를 보면 이 spectral 방법이 faction label에 의해 네트워크 partitioning 을 획득할 수 있다는 것을 확인된다.

보통 우리는 네트워크가 서브그래프 2개보다는 더 잘게 쪼개질 수 있으리라고 예상 가능. spectral 방법을 iterative하게 적용하는 것으로 2개 이상으로 쪼갤 수 있음. 하지만 이러한 반복이 특정 목적 함수를 최적화할 수 있도록 목표하는 것이 바람직함. Newman은 spectral bisection method와 논리적 흐름이 유사하나 Laplacian $L$이 아니라 이를 대체해서 modularity와 연관된 매트릭스를 사용하는 방법을 제안했다.(leading.eigenvector.community)

7.2.3.3 Validation of Graph Partitioning

validation 문제는 그래프 partitioning에 항상 중요하지만, 대부분의 경우 nontrivial 문제이다. 네트워크 그래프에 vertex의 cohesive subset 이 존재한다면, 이러한 subset의 기저에는 vertex에게 있어 vertex 간에 특정한 연관적인 특성 (또는 속성)에 일부 공통성이 있을 것으로 일반적으로 예상한다. 그래프 partitioning은 이러한 성질에 대한 지식이 없을때 그러한 subset을 발견하기 위한 도구로 인식될 수도 있다. 우리가 그래프 외부에서 정의된 클래스 멤버쉽에 대한 subset 정의를 알고 있다면, 그래프 내부에서의 partitioning으로 얻은 분절들과 비교하는 것도 흥미로움.

func.class = get.vertex.attribute(yeast.gc, "Class")
table(func.class)
## func.class
##   A   B   C   D   E   F   G   M   O   P   R   T   U 
##  51  98 122 238  95 171  96 278 171 248  45 240 483

해당 예시는 cell 구축에 있어 protein이 역할하는 바로 분절했음. 단백질들이 서로 다른 단백질들과 얼마나 유사한지는 특정 세포 역할에 해당 단백질이 무슨 일을 하는지와 연관되어 있다고 알려져 있음. 그래프 외부에서 이러한 단백질들을 분류하려는 시도는 분류된 결과가 그래프 내부에서 합리적은 partitioning 과정을 걸쳐 나온 결과물과 어느정도는 연관이 있는게 맞다. 아니면 partitioning이 잘못됐던가 그래프 외부 분절이 잘못됐던가.

yc = fastgreedy.community(yeast.gc)
c.m = membership(yc)
head(table(c.m, func.class, useNA = c("no")))
##    func.class
## c.m   A   B   C   D   E   F   G   M   O   P   R   T   U
##   1   0   0   0   1   3   7   0   6   3 110   2  35  14
##   2   0   2   2   7   1   1   1   4  39   5   0   4  27
##   3   1   9   7  18   4   8   4  20  10  23   8  74  64
##   4  25  11  10  22  72  84  81 168  14  75  16  27 121
##   5   1   7   5  14   0   4   0   2   3   6   1  34  68
##   6   1  24   1   4   1   4   0   7   0   1   0  19  16

7.2.4 Assortativity and Mixing

Assortative mixing

특정 성질에 따라서 vertex 중에 선별적으로 연결.

Assortativity coefficients

assortative mixing의 정도를 량화하는 측도. 이는 correlation coefficients의 변용. vertex 특성은 categorical, ordinal, or continuous 다 가능. categorical 케이스를 가정하고, 그래프 $G$의 각 vertex가 $M$개의 카테고리 중에 label 될 수 있다고 생각하자. 이 세팅에서의 Assortativity coefficients $r_a$는 아래와 같다.

\[ r_a = \frac{\sum_{i}f_{ii} - \sum_i f_{x+}f_{+y}}{1 - \sum_if_{x+}f_{+y}} \]

where $f_{ij}$ is the fraction of edges in $G$ that join a vertex in the $i$-th category with a vertex in the jth category, and $f_{i+}$ and $f_{+i}$ denote the ith marginal row and column sums, respectively, of the resulting matrix $f$.

이때 $-1 \le r_a \le 1$

– It is equal to zero when the mixing in the graph is no different from that obtained through a random assignment of edges that preserves the marginal degree distribution. – It is equal to one when there is perfect assortative mixing (i.e., when edges only connect vertices of the same category). – Howeigenvalueer, in the eigenvalueent that the mixing is perfectly disassortative, in the sense that eigenvalueery edge in the graph connects vertices of two different categories, the coefficient need not take the value −1. • The fact that physical binding of proteins is known to be directly releigenvalueant to functional classes suggests that there will frequently be strong assortative mixing in protein-protein interaction networks with respective to these classes as attributes.

assortativity.nominal(yeast, (replace(V(yeast)$Class, is.na(V(yeast)$Class),
    0) == "P") + 1, directed = FALSE)
## [1] 0.5232879
assortativity.degree(yeast)
## [1] 0.4610798

• When the vertex characteristic of interest is continuous, rather than discrete, denote by (xe, ye) the values of that characteristic for the vertices joined by an edge e ∈ E. • A natural candidate for quantifying the assortativity in this characteristic is just th e Pearson correlation coefficient of the pairs (xe, ye),

\[ r = \frac{\sum_{x,y}xy(f_{xy} - f_{x+}f_{+y})}{\sigma_x \sigma_y} \]